Nonnegative signal factorization with learnt instrument models for sound source separation in close-microphone recordings
© Carabias-Orti et al.; licensee Springer. 2013
Received: 29 June 2013
Accepted: 21 November 2013
Published: 13 December 2013
Close-microphone techniques are extensively employed in many live music recordings, allowing for interference rejection and reducing the amount of reverberation in the resulting instrument tracks. However, despite the use of directional microphones, the recorded tracks are not completely free from source interference, a problem which is commonly known as microphone leakage. While source separation methods are potentially a solution to this problem, few approaches take into account the huge amount of prior information available in this scenario. In fact, besides the special properties of close-microphone tracks, the knowledge on the number and type of instruments making up the mixture can also be successfully exploited for improved separation performance. In this paper, a nonnegative matrix factorization (NMF) method making use of all the above information is proposed. To this end, a set of instrument models are learnt from a training database and incorporated into a multichannel extension of the NMF algorithm. Several options to initialize the algorithm are suggested, exploring their performance in multiple music tracks and comparing the results to other state-of-the-art approaches.
Multitrack audio recording techniques are based on capturing and recording individual sound sources into multiple discrete audio channels. Once all the sound sources have been recorded, the individual tracks are processed and mixed down to a number of mixture channels that depends on the specific audio reproduction format. Multitrack recording techniques can be broadly classified into live recording and track-by-track recording techniques. In the latter type, the performers are individually recorded one after another, resulting in almost perfectly isolated instrument tracks. On the other hand, in live audio recordings, the source signals, which share the acoustic space, are all acquired simultaneously during the performance . This leads to the well-known microphone leakage problem: the sounds coming from the concurrent sources are picked up by microphones others than the ones intended for the specific sources . To address this issue in close-miking techniques, a directional microphone is placed relatively close to an instrument, reducing the interference from other sources and the effect of room reverberation. Other mechanical and signal processing devices, such as absorbing barriers or noise gates, are also employed by sound engineers to mitigate this problem, but they only solve the problem partially, being most effective when used with transient signals .
Sound source separation (SSS) techniques have been suggested as a potential solution for the microphone leakage problem in multitrack live recordings [3, 4]. In general, the aim of SSS is to recover each source signal from a set of audio mixtures. SSS techniques can be broadly divided into blind source separation (BSS) and informed source separation (ISS) algorithms. BSS methods are especially popular in the statistical signal processing and machine learning areas, where the term blind emphasizes that very little information about the sources or the mixing process is known a priori . Techniques such as principal component analysis (PCA), independent component analysis (ICA) or nonnegative matrix factorization (NMF)  have been introduced both to reduce the dimensionality and to explain the whole data by a few meaningful elementary objects. In fact, many BSS approaches are closely related to ICA, where the sources are assumed to be statistically independent and non-Gaussian. Most of these approaches are oriented to the determined separation problem, i.e., the number of sources equals the number of mixture signals. When the number of sources is greater than the number of mixtures, the problem is said to be underdetermined, and the underlying assumptions usually involve the sparsity of the sources under some suitable representation such as the time-frequency domain [7, 8]. Moreover, the assumptions may also differ depending on the acoustic environment, leading to instantaneous or convolutive separation methods. Instantaneous mixing models (IMM) assume a mixing matrix made up of scalar coefficients while convolutive models are often based on the estimation of unmixing filters . When working in the frequency domain, the mixture can be assumed to be instantaneous at each frequency bin and standard approaches such as ICA can be applied by following a subband approach . However, due to the ICA permutation ambiguity, an alignment procedure requiring some additional information is necessary to group the resulting components into estimated source signals. When the signal model is assumed to be nonnegative, NMF provides a meaningful structure of the audio data, which in this case is obtained from the magnitude or energy spectrograms. NMF methods have been shown to be specially useful for musical analysis tasks , including not only SSS but also others, such as automatic music transcription  or acoustic space characterization . NMF is based on decomposing the spectrogram audio data into a sum of elementary spectral patterns with time-varying gains. While NMF was originally proposed in the context of monaural SSS, other extensions have been developed for dealing with multichannel audio mixtures . As a result, NMF approaches are progressively becoming a promising solution to multichannel SSS. However, spectral patterns learnt by NMF-based approaches are often hard to interpret and lack explicit semantics. To overcome this issue, many algorithms constrain the original NMF to obtain musically meaningful patterns, for example, by considering a parametric model. In this context, the spectral patterns can be described by harmonic combs [15–17], spectrally and/or temporally localized Gaussians [18, 19], or by using a source/filter model [20–22].
In contrast to BSS, ISS methods depart from an available prior information, which can be under the form of specific information about the sources, the mixing process, or additional modalities . For example, an ISS method which is oriented to close-miking live music recordings could exploit the properties of this specific setup: each microphone signal contains one of the sources significantly enhanced over the others due to both the directional properties of the sensors and to their placement. In this context, Kokkinis and Morjopoulos  showed that under a close-miking assumption, a relatively simple Wiener filter outperforms some convolutive BSS algorithms. However, more sophisticated methods making use of additional prior information can be developed by considering a supervised separation framework. For example, musical score information can be used if the score and audio are well aligned [24–27]. Spectral information can be considered by using instrument models when the instruments are known in advance [28, 29]. Other kinds of information, such as high-level musicological knowledge, have been recently introduced by Fuentes et al. , using recent advances in shift-invariant analysis of musical data. Regarding factorization methods, an important issue to take into account is the initialization/constrain of the parameters. In this context, Hurmalainen et al.  proposed a method for automatic adaptation of learnt clean speech source models to deal with noise in a speech separation and recognition task. Furthermore, Fitzgerald  presented a framework that allows the user to interact with the tensor factorization method to improve the performance in an adaptive way. Finally, the prior information can be the sources themselves. This knowledge enables the computation of side information, which is small enough to be inaudibly embedded into the mixtures. At a decoding step, this small side information is used along with the mixtures to recover the sources. Following this scheme, Liutkus et al.  proposed a system coding approach that permits very reliable transmission of the sources with a small amount of side information.
In this paper, an informed NMF-based SSS method is presented to tackle the microphone leakage problem in multichannel close-microphone recordings. To this end, several assumptions are taken on the mixing environment, affecting problem dimensionality, direct-to-reverberant sound ratio and available instrument priors. In this context, it is assumed that the number of source signals is equal or less than the number of microphone signals, having each mixture signal a predominant direct-sound source resulting from a close-miking recording setup. Therefore, since the predominant source is captured with a high direct-to-reverberant ratio, a instantaneous model can be reasonably assumed, significantly simplifying the separation task. Since the method is constrained to be nonnegative, panning matrix is used to determine the mixing process. Moreover, instrument model priors are obtained by means of a learning stage using a training database. The usefulness of these models is twofold. On the one hand, they enable an accurate estimation of the panning matrix. On the other hand, they simplify the separation stage by reducing the factorization to the estimation of instrument time-varying gains.
The paper is structured as follows. Section 2 provides an overview of the proposed SSS system and describes the fundamentals of NMF-based separation and instrument modeling. Section 3 describes the proposed multichannel extension for informed NMF-based separation using learnt instrument models. Panning matrix estimation and NMF-based separation are described in detail, explaining how the output of an automatic music transcription stage is used to discriminate single-source time-frequency zones. Section 4 describes the experiments conducted by using several music pieces in a simulated close-microphone setup and evaluates the separation performance by using objective measures. Finally, Section 5 summarizes the conclusions of this work.
2 Model description and background
2.1 System overview
2.2 NMF background
where g n (t) is the gain of the basis function n at frame t, and b n (f), n = 1,…,N are the bases. Note that this approach holds under two different configurations:
Therefore, whenever model in Equation 1 is chosen, either assumptions (a) or (b) are supposed to hold, and the time-frequency (TF) representation considered is either magnitude or power spectrogram for (a) or power spectrogram only for (b).
where the time gains g j,n (t) and the harmonic amplitudes a j,n (h) are the parameters to be estimated.
2.3 Augmented NMF for parameter estimation
The main advantage of the MU in Equation 5 is that it ensures nonnegativity of all parameters provided that they are nonnegative at initialization.
2.4 Instrument modeling
As demonstrated in , when appropriate training data are available, it is advantageous to learn the instrument-dependent bases in advance and fix them during the analysis of the signals. In fact, this approach has been shown to perform well when the conditions of the music scene do not differ too much between the training and the test data. Here, we have used an approach similar to . Specifically, the amplitudes of each note of a musical instrument a j,n (h) are learnt in advance by using the Real World Computing (RWC) music database [48, 49] as a training database of solo instruments playing isolated notes (more details on the Section 4.2.). Let the ground-truth transcription of the training data be represented by R j,n (t) as a binary time/frequency matrix for each j instrument. The frequency dimension represents the musical instrument digital interface (MIDI) scale and the time dimension t represents the frames. At the training stage, gains are initialized with R j,n (t), which is known in advance for the training database. Thus, gains are set to unity for each pitch at those time frames where the instrument is active while the rest of the gains are set to zero. Note that gains initialized to zero remain at zero because of the multiplicative update rules, and therefore the frame is represented only with the correct pitch.
The training procedure is summarized in Algorithm 1.
Algorithm 1 Instrument modeling algorithm
The training algorithm computes the basis functions b j,n (f) required at the factorization stage for each instrument. These instrument-dependent basis functions b j,n (f) are known and held fixed, therefore, the factorization of new signals of the same instrument can be reduced to the estimation of the gains g j,n (t).
3 Proposed extension to multichannel
The previously described NMF-based model is suitable for single-channel data. However, most music recordings are available in a multichannel format, being stereo the most common. To deal with multichannel audio data, an extension of the standard NMF model is required. In the literature, multichannel extensions of NMF have already been considered, either by stacking up the spectrograms of each channel into a single matrix  or by equivalently considering nonnegative tensor factorization (NTF) under a parallel factor analysis (PARAFAC) structure, where the channel spectrograms form the slices of a 3-valence tensor [42, 51, 52].
In this paper, we propose an extended multichannel NMF model that is specifically designed for close-microphone music recordings. While this kind of recordings are not usually commercially distributed, many of the raw recordings used in the studio during the mixing process share many similarities among them. The particularities of this scenario define a set of assumptions that are considered in the proposed NMF algorithm:
Problem dimensionality: The proposed method is designed for an over-determined scheme, that is, I≥J where I is the number of channels and J the number of sources.
Single predominant source: For each channel i, there is a single predominant source j ′ that corresponds to a music instrument which is known in advance.
Mixing model: In this work, instantaneous mixing of point sources is considered. Note that the actual mixing process in a close-microphone recording is convolutive. However, since the predominant source is captured with a high direct-to-reverberant ratio, a instantaneous model can be reasonably assumed to simplify the processing. Still, the proposed method can readily be extended to the case of a convolutive mixture, simply by assuming a mixing matrix that varies over frequency .
Input representation: More details about the TF representation are given in the experimental section.
where is the estimation of the complex-valued STFT for each i channel; s̲ j (f,t) is the estimation of the complex-valued STFT generated by the source j = 1,…,J; J is the number of sources; and the scalar coefficients m i,j define a I×J panning matrix M that measures the multichannel contribution of source j to the data. Note that the mixing coefficients are defined in function of the kind of spectrogram used (i.e., magnitude or power spectrogram). On the one hand, if we are considering magnitude spectrograms, mixing coefficients can be defined as |m i,j |, usually called the mixing matrix. On the other hand, for power spectrograms mixing coefficients are defined as |m i,j |2.
3.1 Panning matrix estimation
The estimation of the panning matrix is performed in two steps. First, an NMF-based automatic transcription method is applied in order to estimate the active notes of the predominant source at each channel. Then, the estimated transcription of the predominant source for each channel is used to discriminate those TF zones in which the sources are presented in an isolated way. Finally, the panning matrix is computed using this information.
3.1.1 Predominant source transcription method
In this step, we describe two NMF-based methods, one for monophonic and the other for polyphonic signals, to estimate the transcription of the predominant source for each channel individually. These methods were previously developed by the authors in  in the context of monaural mixtures. The methods are supervised, requiring fixed basis functions trained using the instrument modeling procedure in Section 2.4. The aim here is to estimate the transcription of the predominant source j ′ at each channel i. This information must be known in advance in order to define the proper basis functions .
- 1.Monophonic sources In the case of monophonic sources, we propose to use the real-time single-pitch constrained method proposed in . In this transcription method, the optimum combination of notes n opt(i,t) is chosen to minimize the beta-divergence function at channel i and frame t under the assumption that only one gain is nonzero at each frame for channel i, being j ′ the predominant source. Assuming a single predominant source j ′ for each channel i, the signal model with the single-combination constraint can be defined as follows:(12)where n opt(i,t) is defined for each channel i as(13)
that is, the spectrum for each channel i at each frame t is approximated by the projection of the predominant source j ′ for the optimum note n opt at frame t. As an advantage, the model of Equation 12 allows the gains to be computed directly from the data and the trained amplitudes without the need for an iterative algorithm.The beta-divergence at note n and frame t for the predominant source j ′ at channel i is obtained as(14)The value of the gain for channel i, source j ′, note n and frame t is obtained by minimizing Equation 14. This minimization has a direct solution, since the value of the gain for note n and frame t is a scalar:(15)Finally, the optimum note at each frame for each channel is selected as the note that minimizes the beta-divergence at each frame and channel:(16)
where the proposed solution is valid for β∈ [ 0,2].
To summarize, the monophonic predominant source transcription (MPST) method is detailed in Algorithm 2.
Algorithm 2 Monophonic predominant source transcription (MPST) method
- 2.Polyphonic sources In the case of polyphonic sources, the method presented in  is used to obtain the transcription of the predominant source j ′ at each channel i. However, we highlight the fact that any polyphonic estimation procedure may be used at this stage, e.g., the one presented in . In this paper, the applied method is analogous to the classical Euclidean-NMF using the gradient descent algorithm, but no iterative process is required, allowing its use for real-time problems. In the case of β=2 (Euclidean distance), Equation 4 for each channel i can be expressed in matrix notation as(17)where X i is the signal input matrix at channel i in time and frequency, j ′ is the predominant source at channel i, the j ′ source basis matrix, and the j ′ source gain matrix. Then we can examine this factorization as a reduced-rank basis decomposition so that and, subsequently, the gains can be estimated in just one step:(18)
where and the † operator is the Moore-Penrose matrix inverse.
As commented in , the transcription results with β=2 are poor in comparison with other values of β. Therefore, to improve the performance of the method, a candidate selection stage using the previously explained MPST method will be applied. As a result, the method for polyphonic sources is restricted to have only a few gains that are not zero for each instrument at each frame. Detailed information about the method can be found in . Algorithm 3 describes the computational procedure in the proposed transcription method for polyphonic sources.
Algorithm 3 Polyphonic predominant source transcription (PPST) method
where is the resulting transcription composed of binary values and T is a fixed detection threshold in decibels (dB) that can be either set manually or learnt from training data. Note that in the case of MPST, where only one note is active at a time, a threshold is used to discard those notes actives during silence intervals.
3.1.2 Time-frequency zones discrimination
Once the transcription procedure of the predominant source at each channel i has been performed, time-frequency zones in the mixture corresponding only to predominant sources must be discriminated in order to estimate the panning matrix. These zones are assumed to be free from source overlapping, thus, partials from the predominant source are likely not to be corrupted by notes from the rest of instruments. In fact, the information resulting from overlapped partials at each frame is considered as corrupted and it is not used to estimate the panning matrix.
3.1.3 Energy-based panning matrix estimation
Once the overlapping mask ϑ j (f,t) is computed for all the channels, the panning matrix is estimated (see Figure 1). The proposed method computes each panning coefficient as the relation between the norm of each instrument at each channel in its time-frequency region and the norm of each instrument at its predominant channel in its time-frequency region (the panning coefficient for the instrument at its predominant channel is supposed to be 1). Note that the panning estimation method does not include the phase of the coefficient; in other words, the proposed method estimates the matrix coefficients for magnitude spectrograms. The energy-based panning matrix estimation is detailed in Algorithm 4, where j ′ is the predominant source for channel i, ∘ is the Hadamard product and ∥·∥2 is the 2-norm (Euclidean distance).
Algorithm 4 Panning matrix estimation method
Therefore, Algorithm 4 computes the panning matrix as the quotient between the contribution of each source to the channel spectrogram against the contribution of the predominant source.
3.2 Multichannel SSS
Then, classical augmented NMF factorization with MU rules is applied to estimate the gains corresponding to each source j in the multichannel mixture. The process is detailed in Algorithm 5.
Algorithm 5 Multichannel signal gain estimation method
3.2.1 Ideal Wiener masks
The source separation consists of estimating the complex amplitude at each time-frequency cell for each source. Some systems use binary separation, which means that the entire energy of a bin is assigned to a single source. However, it has been demonstrated that better results can be obtained with a nonbinary decision, i.e., distributing the energy proportionately over all the sources. The use of separation Wiener masks is common in the source separation literature [37–39]. The Wiener filter method for instantaneous mixing models is described below.
where |s̲ j (f,t)|2 is called the power spectral density of source j at TF bin (f,t).
Then, the estimated source is computed by the inverse overlap-add STFT of the estimated spectrogram .
3.2.2 Separated signal decomposition
In the present work, the panning matrix is estimated and used together with the learnt instrument models to perform the separation in an NMF-based framework.
Then, the estimated Wiener mask for each source is applied to the multichannel signal spectrogram at channel i following Equation 24 using the phase information from the original mixture signal of the close-microphone near the target instrument. Therefore, the estimated predominant source spectrogram is obtained and the estimated predominant source is computed by applying the inverse overlap-add STFT of using the phase information from x̲ i (f,t) where j ′ is the predominant source at channel i.
4.1 Training and test data
At the training stage (see Section 2.3), the basis functions are estimated using the RWC musical instrument sound database [48, 49] and the full pitch range for each instrument. Four instruments are studied in the experiments (violin, clarinet, tenor saxophone, and bassoon). Individual sounds are available with a semitone frequency resolution over the entire range of notes for each instrument. Files from the RWC database have different playing styles. Files with a normal playing style and mezzo dynamic level are selected as in the literature. Training with different playing styles leads to different models. However, as demonstrated in , the selected configuration (normal playing style and mezzo dynamic level) is representative of the different models.
The database proposed in  is used for the testing stage. This database consists of ten J.S. Bach four-part chorales  with the corresponding aligned MIDI data. The audio files are approximately 30 s long and are sampled at 44.1 KHz from the real performances. Each music excerpt consists of an instrumental quartet (violin, clarinet, tenor saxophone, and bassoon), and each instrument is given in an isolated track.
4.2 Experimental setup
4.2.1 Time-frequency representation
Many NMF-based signal processing applications usually adopt a logarithmic frequency discretization. For example, uniformly spaced subbands on the equivalent rectangular bandwidth (ERB) scale are assumed in [16, 17]. In this work, two time-frequency resolutions are used. First, to estimate the instrument models and the panning matrix, a single semitone resolution was used as in . In fact, the training database and the ground-truth score information are composed of notes that are separated by one semitone in frequency. This representation has proven to obtain accurate results for music transcription, which is the key point when estimating the panning matrix. Second, for the separation task, a higher resolution of 1/4 of semitone is used as in , which has proven to achieve better separation results. These time-frequency representations are obtained by integrating the STFT bins corresponding to the same semitone, or 1/4 semitone, interval. Note that in the separation stage, the learnt basis functions b j,n (f) are adapted to the 1/4 semitone resolution by replicating at four times the basis at each semitone to the four samples of the 1/4 semitone resolution that belong to this semitone. The frame size and the hop size for the STFT are set to 128 and 32 ms, respectively.
4.2.2 Initialization of model parameters
Basis functions (N=88), ranging from MIDI note 20 to 108
Partials per basis function for the harmonic constraint models (M=20)
Iterations for the NMF-based algorithms (50)
4.2.3 Audio separation: method and metrics
For an objective evaluation of the performance of the separation method, the Perceptual Evaluation methods for Audio Source Separation (PEASS) toolkit  has been used. The use of objective measures based on energy ratios between the signal components, i.e., source to distortion ratio (SDR), the source to interference ratio (SIR), the source to artifacts ratio (SAR) and the source image to spatial distortion ratio (ISR), has been the standard approach in the specialized scientific community to test the quality of extracted signals.
Moreover, the overall perceptual score (OPS), the target-related perceptual score (TPS), the interference-related perceptual score (IPS) and the artifacts-related perceptual score (APS) objective measures have been used with the aim of predicting a set of subjective scores. The approach to compute the objective measures  makes use of auditory-motivated metrics provided by the PEMO-Q auditory model to assess the perceptual salience of the target distortion (qTarget), interference (qInterf) and artifacts (qArtif), computing also a global metric (qGlobal). Then, a nonlinear mapping using neuronal networks trained with a large set of different audio signals is performed in order to get the set of objective measures. Further information about this metrics can be found in .
The proposed separation approach, shown in Figure 1, is going to be compared with some state-of-the art methods and some unrealistic situations in order to evaluate its separation capabilities. The different approaches compared here are the following:
Default: It refers to the actual separation performed by the simulation setup presented in Figure 5. Since the sensors have cardioid directivity characteristics and they are placed following a close-miking setup, the instrument close to each microphone is predominant in the corresponding mixture channel.
Ideal separation: This method performs as an upper bound for the best separation that can be achieved with the used time-frequency representation. The optimal value of the Wiener mask at each frequency and time component is computed assuming that the signals to be separated are known in advance.
Oracle IMM: This approach evaluates the limitations of using an instantaneous mixing model. The separation scheme is similar to the one presented in Figure 5 but the panning matrix is estimated by knowing in advance the signals to be separated. The separation approach is identical to the proposed one but the panning matrix is optimal. We have included this method in order to evaluate the influence of the proposed panning matrix estimation stage on the final performance.
Kokkinis method: We have included in the evaluation the results of the method proposed by Kokkinis et al. in  as the state-of-the-art method for addressing the microphone leakage problem.
As can be seen, the best results are obtained with the ideal separation method, the average SDR value is about 21 dB and informs us about the best separation that can be achieved with the used time-frequency representation (1/4 semitone resolution in frequency). The default method is limited by the microphone leakage and obtains an average SDR of 12 dB. The oracle IMM approach provides information about the best separation results that can be obtained using our approach with an optimal panning matrix. In fact, the proposed method is similar to the Oracle solution for the studied signals; therefore, the transcription procedure in the mixture matrix stage is providing accurate estimation of the ground-truth data.
Finally, the Kokkinis’ method is about 1 dB on average in SDR below the proposed method. These last results suggest that while Wiener filtering might be a simple and powerful approach to solve the microphone leakage problem, there is room for further improvements by following an informed approach such as the one presented in this paper.
Multichannel source separation results measured using PEASS toolkit
Direct β 1.3
More detailed information about the separation metrics is presented in Table 1. In this table, the different metrics are given per instrument on average for the ten excerpts in the test database. In relation to the classical separation metrics (i.e., SDR, SAR, SIR and ISR, in decibels), the default method is limited by the interferences between the different instruments (SIR) while the other methods achieve better SIR results. Moreover, the bassoon separation performs worse in general, in comparison with the other instruments. This fact has been observed in other studies made by the authors [43, 53]. Actually, for the bassoon case, the amplitude variations over the time line of the note cause a mismatch with the window transform in certain frequency locations (i.e., blurred regions in the spectrogram).
Similar conclusions can be extracted by analyzing the results obtained with the perceptual similarity measures (PSM) provided by the PEMO-Q auditory model. However, with these metrics the differences between the instruments vary as a function of the analyzed signals. The Kokkinis’ approach obtains better results than the proposed model (Direct β 1.3) and the oracle IMM in terms of qGlobal (PSM values are within 0 and 1, being the unity the best result). This discrepancy between perceptual metrics (qGlobal) and classical separation metrics (SDR) can be justified after listening to the separated excerpts. In our opinion, Kokkinis’ approach offers better separation capabilities at high frequencies while at low frequencies the amount of distortion seems higher. In contrast, the proposed approach seems to have higher distortion at high frequencies. Probably, the use of a softer mask in Kokkinis’ approach might be a reason for these differences. It must be stressed that the optimum β value has been selected in this work in terms of SDR. Further work could be done to explore the performance of this method when the β value is selected to be optimal in terms of the perceptual metrics.
Finally, regarding the measures proposed in PEASS (i.e., OPS, TPS, IPS, and APS) there is a strong correlation between these measures and PSM. Generally, the different approaches obtain the same classification from better to worse in terms of OPS as in terms of qGlobal (PEASS metrics are within a 0 to 100 interval, being 100 the best result). The only exception is the default approach for the cases of bassoon and clarinet. In these cases, the OPS shows higher values, which are in contrast with the lower values of qGlobal. However, the separated excerpts for the default approach when playing bassoon and clarinet clearly have less perceptual quality when compared to the other approaches. The nonlinear mapping provided by the PEASS neural network does not offer satisfactory results in these cases.
In this paper, an informed NMF-based SSS method has been proposed to tackle the microphone leakage problem in multichannel close-microphone recordings. The proposed method is specifically designed for a scenario in which the number of source signals is equal or less than the number of microphone signals and a single predominant source is considered for each mixture signal. As demonstrated in the evaluation stage, despite assuming instantaneous mixing and using fixed instrument models, the proposed method provides similar performance to other state-of-the-art approaches, showing the potential of NMF-based approaches in real-world applications. Moreover, the use of trained instrument models allows for a fast computation of the panning matrix and simplifies the separation stage by reducing the factorization to the estimation of instrument time-varying gains. However, these models are fixed and, therefore, the differences with respect to the spectra of the analyzed instruments in the mixture may lead to worse separation results, as seen in the case of the bassoon. Further work will be aimed at adapting the parameters of the model to the observed music scene. To address this issue, a proper initialization of the gains and the use of additional optimization constraints will be considered. This way, the parameters will only be adapted when there is high confidence that a note is active and free of interference.
This work was supported by the Andalusian Business, Science and Innovation Council under project P2010- TIC-6762, (FEDER) the Spanish Ministry of Economy and Competitiveness under the projects TEC2012-38142-C04-03 and TEC2012-37945-C02-02. The authors would like to thank the anonymous reviewers whose comments greatly helped to improve the original manuscript as well as Z. Duan for kindly sharing his annotated real-world music database.
- Huber DM, Runstein RE: Modern Recording Techniques. Focal Press, UK; 2009.Google Scholar
- Clifford A, Reiss JD: Microphone interference reduction in live sound. In Proceedings of the 14th International Conference on Digital Audio Effects (DAFx-11). Paris; 19–23 September 2011.Google Scholar
- Kokkinis EK, Mourjopoulos J: Unmixing acoustic sources in real reverberant environments for close-microphone applications. J. Audio Eng. Soc 2010, 58(11):907-922.Google Scholar
- Kokkinis EK, Reiss JD, Mourjopoulos J: A Wiener filter approach to microphone leakage reduction in close-microphone applications. IEEE Trans. Audio, Speech, Language Process 2012, 20(3):767-779.View ArticleGoogle Scholar
- Comon P, Jutten C: Handbook of Blind Source Separation: Independent Component Analysis and Applications. Academic Press, Oxford; 2010.Google Scholar
- Lee DD, Seung HS: Learning the parts of objects by non-negative matrix factorization. Nature 1999, 401: 799-791.Google Scholar
- Yilmaz O, Rickard S: Blind separation of speech mixtures via time-frequency masking. IEEE Trans. Signal Process 2004, 52(7):1830-1847. 10.1109/TSP.2004.828896MathSciNetView ArticleGoogle Scholar
- Cobos M, Lopez JJ: Maximum a posteriori binary mask estimation for underdetermined source separation using smoothed posteriors. IEEE Trans. Audio, Speech, Language Process 2012, 20(7):2059-2012.View ArticleGoogle Scholar
- Pedersen MS, Larsen J, Kjems U, Parra LC: Convolutive blind source separation methods. In Springer Handbook of Speech Processing. Edited by: Benesty J, Sondhi MM, Huang Y. Springer, Berlin; 2008:1065-1084.View ArticleGoogle Scholar
- Smaragdis P: Blind separation of convolved mixtures in the frequency domain. Neurocomputing 1998, 22(1):21-34.View ArticleGoogle Scholar
- Virtanen T: Sound source separation in monaural music signals,. Thesis, Tampere University of Technology, 2006Google Scholar
- Smaragdis P, Brown J: Non-negative matrix factorization for polyphonic music transcription. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). New Paltz; 19–22 October 2003.Google Scholar
- Cobos M, Vera-Candeas P, Carabias-Orti JJ, Ruiz-Reyes N, Lopez JJ: Blind estimation of reverberation time from monophonic instrument recordings based on non-negative matrix factorization. In Proceedings of the AES 42nd International Conference: Semantic Audio. Ilmenau; 22–24 July 2011.Google Scholar
- Ozerov A, Févotte C: Multichannel non-negative matrix factorization in convolutive mixtures for audio source separation. IEEE Trans. Audio, Speech, Language Process 2010, 18(3):550-563.View ArticleGoogle Scholar
- Hennequin R, Badeau R, David B: Time-dependent parametric and harmonic templates in non-negative matrix factorization. In Proceedings of the International Conference on Digital Audio Effects (DAFx). Graz; 6–10 September 2010:246-253.Google Scholar
- Vincent E, Bertin N, Badeau R: Adaptive harmonic spectral decomposition for multiple pitch estimation. IEEE Trans. Audio, Speech, Language Process 2010, 18(3):528-537.View ArticleGoogle Scholar
- Bertin N, Badeau R, Vincent E: Enforcing harmonicity and smoothness in bayesian non-negative matrix factorization applied to polyphonic music transcription. IEEE Trans. Audio, Speech, Language Process 2010, 18(3):538-549.View ArticleGoogle Scholar
- Itoyama K, Goto M, Komatani K, Ogata T, Okuno HG: Instrument equalizer for query-by-example retrieval: improving sound source separation based on integrated harmonic and inharmonic models. In Proceedings of the International Conference for Music Information Retrieval (ISMIR). Philadelphia; 14–18 September 2008:133-138.Google Scholar
- Wu J, Vincent E, Raczynski SA, Nishimoto T, Ono N, Sagayama S: Multipitch estimation by joint modeling of harmonic and transient sounds. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Prague; 22–27 May 2011:25-28.Google Scholar
- Heittola T, Klapuri A, Virtanen T: Musical instrument recognition in polyphonic audio using source-filter model for sound separation. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Kobe; 26–30 October 2009:327-332.Google Scholar
- Durrieu JL, David B, Richard G: A musically motivated mid-level representation for pitch estimation and musical audio source separation. IEEE J. Selected Topics Signal Process 2011, 5(6):1180-1191.View ArticleGoogle Scholar
- Carabias-Orti JJ, Virtanen T, Vera-Candeas P, Ruiz-Reyes N, Cañadas-Quesada FJ: Musical instrument sound multi-excitation model for non-negative spectrogram factorization. IEEE J. Selected Topics Signal Process 2011, 5(6):1144-1158.View ArticleGoogle Scholar
- Ozerov A, Liutkus A, Badeau R, Richard G: Informed source separation: source coding meets source separation. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA’11). New Paltz; 16–19 October 2011.Google Scholar
- Bosch JJ, Kondo K, Marxer R, Janer J: Score-informed and timbre independent lead instrument separation in real-world scenarios. In Proceedings of the 20th European Signal Processing Conference (EUSIPCO). Bucharest; 27–31 August 2012:2417-2421.Google Scholar
- Hennequin R, David B, Badeau R: Score informed audio source separation using a parametric model of non-negative spectrogram. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Prague; 22–27 May 2011:45-48.View ArticleGoogle Scholar
- Ganseman J, Scheunders P, Mysore G, Abel J: Evaluation of a score-informed source separation system. In 11th International Society for Music Information Retrieval Conference (ISMIR 2010). Utrecht; 9–13 August 2010.Google Scholar
- Duan Z, Pardo B: Soundprism: an online system for score-informed source separation of music audio. Selected Topics Signal Process. IEEE J 2011, 5(6):1205-1215.View ArticleGoogle Scholar
- Simsekli U, Cemgil AT: Score guided musical source separation using generalized coupled tensor factorization. Proceedings of the 20th European Signal Processing Conference (EUSIPCO) 27–31 August 2012, 2639-2643.Google Scholar
- Rodriguez-Serrano FJ, Carabias-Orti JJ, Vera-Candeas P, Canadas-Quesada FJ, Ruiz-Reyes N: Monophonic constrained non-negative sparse coding using instrument models for audio separation and transcription of monophonic source-based polyphonic mixtures. Multimedia Tools Appl 2013. Available at http://link.springer.com/article/10.1007%2Fs11042-013-1398-8 Google Scholar
- Fuentes B, Badeau R, Richard G: Blind harmonic adaptive decomposition applied to supervised source separation. In Proceedings of the 20th European Signal Processing Conference (EUSIPCO). Bucharest; 27–31 August 2012:2654-2658.Google Scholar
- Hurmalainen A, Gemmeke J, Virtanen T: Detection, separation and recognition of speech from continuous signals using spectral factorisation. In Proceedings of the 20th European Signal Processing Conference (EUSIPCO). Bucharest; 27–31 August 2012:2649-2653.Google Scholar
- Fitzgerald D: User assisted separation using tensor factorisations. In Proceedings of the 20th European Signal Processing Conference (EUSIPCO). Bucharest; 27–31 August 2012:2412-2416.Google Scholar
- Liutkus A, Pinel J, Badeau R, Girin L, Richard G: Informed source separation through spectrogram coding and data embedding. Signal Process 2012, 92(8):1937-1949. 10.1016/j.sigpro.2011.09.016View ArticleGoogle Scholar
- Casey M, Westner A: Separation of mixed audio sources by independent subspace analysis. In Proceedings of the International Computer Music Conference (ICMC ’00). Berlin; September 2000:154-161.Google Scholar
- Virtanen T: Sound source separation using sparse coding with temporal continuity objective. In Proceedings of the International Computer Music Conference (ICMC ’03). Singapore; September 2003.Google Scholar
- Virtanen T: Monaural sound source separation by non-negative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio, Speech, Language Process 2007, 15(3):1066-1074.View ArticleGoogle Scholar
- Benaroya L, Bimbot F, Gribonval R: Audio source separation with a single sensor. Audio, Speech, Language Process. IEEE Trans 2006, 14(1):191-199.View ArticleGoogle Scholar
- Cemgil AT, Peeling P, Dikmen O, Godsill S: Prior structures for time-frequency energy distributions. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. New Paltz; 21–24 October 2007:151-154.View ArticleGoogle Scholar
- Liutkus A, Badeau R, Richard G: Gaussian processes for underdetermined source separation. Signal Process. IEEE Trans 2011, 59(7):3155-3167.MathSciNetView ArticleGoogle Scholar
- Virtanen T, Klapuri A: Analysis of polyphonic audio using source-filter model and non-negative matrix factorization. In Advances in Models for Acoustic Processing, Neural Information Processing Systems Workshop. Whistler; 9 December 2006.Google Scholar
- Raczyński SA, Ono N, Sagayama S: Multipitch analysis with harmonic nonnegative matrix approximation. In Proceedings of the International Conference on Music Information Retrieval (ISMIR). Vienna; 23–27 September 2007:381-386.Google Scholar
- FitzGerald D, Cranitch M, Coyle E: Extended nonnegative tensor factorisation models for musical source separation. Comput. Intell. Neurosci 2008, 2008: 15. Article ID 872425View ArticleGoogle Scholar
- Carabias-Orti JJ, Rodriguez-Serrano FJ, Vera-Candeas P, Canadas-Quesada FJ, Ruiz-Reyes N: Constrained non-negative sparse coding using learnt instrument templates for realtime music transcription. Eng. Appl. Artif. Intell 2013, 26(7):1671-1680. 10.1016/j.engappai.2013.03.010View ArticleGoogle Scholar
- Ozerov A, Vincent E, Bimbot F: A general flexible framework for the handling of prior information in audio source separation. IEEE Trans. Audio, Speech, Language Process 2012, 20(4):1118-1133.View ArticleGoogle Scholar
- Févotte C, Bertin N, Durrieu JL: Nonnegative matrix factorization with the Itakura-Saito divergence. With application to music analysis. Neural Comput 2009, 21(3):793-830.Google Scholar
- Févotte C, Idier J: Algorithms for nonnegative matrix factorization with the beta-divergence. Neural Comput 2011, 23(9):242-2456.View ArticleGoogle Scholar
- Lee DD, Seung HS: Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems 2001, 13: 556-562.Google Scholar
- Goto M, Hashiguchi H, Nishimura T, Oka R: RWC music database: popular, classical, and jazz music databases. In Proceedings of the 3rd International Society for Music Information Retrieval Conference (ISMIR). Paris; 13–17 October 2002.Google Scholar
- Goto M: Development of the RWC music database. In Proceedings of the 18th International Congress on Acoustics (ICA 2004). Kyoto; 4–9 April 2004:I-553-556. (invited paper)Google Scholar
- Parry RM, Essa IA: Estimating the spatial position of spectral components in audio. In Proceedings of the 6th International Conference of Independent Component Analysis and Blind Signal Separation (ICA’06). Charleston; 5–8 March 2006:666-673.View ArticleGoogle Scholar
- FitzGerald D, Cranitch M, Coyle E: Non-negative tensor factorisation for sound source separation. In Proceedings of the Irish Signals and Systems Conference. Dublin; September 2005:8-12.Google Scholar
- Fevotte C, Ozerov A: Notes on nonnegative tensor factorization of the spectrogram for audio source separation: statistical insights and towards self-clustering of the spatial cues. In Proceedings of the 7th International Symposium on Computer Music Modeling and Retrieval (CMMR). Malaga; 21–24 June 2010:102-115.Google Scholar
- Carabias-Orti JJ, Virtanen T, Vera-Candeas P, Ruiz-Reyes N, Canadas-Quesada FJ: Musical instrument sound multi-excitation model for non-negative spectrogram factorization. IEEE J. Selected Topics Signal Process 2011, 5(6):1144-1158.View ArticleGoogle Scholar
- Campbell DR, Palomaki KJ, Brown GJ: A MATLAB simulation of “shoebox” room acoustics for use in research and teaching. Comput. Inf. Syst. J 2005, 9(3):48-51.Google Scholar
- Emiya V, Vincent E, Harlander N, Hohmann V: Subjective and objective quality assessment of audio source separation. IEEE Trans. Audio, Speech Language Process 2011, 19(7):2046-2057.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.