Blind separation of overlapping partials in harmonic musical notes using amplitude and phase reconstruction
 Jesús Ponce de León^{1}Email author and
 José Ramón Beltrán^{1}
https://doi.org/10.1186/168761802012223
© Ponce de León and Beltrán; licensee Springer. 2012
Received: 20 May 2011
Accepted: 2 July 2012
Published: 16 October 2012
Abstract
In this study, a new method of blind audio source separation (BASS) of monaural musical harmonic notes is presented. The input (mixed notes) signal is processed using a flexible analysis and synthesis algorithm (complex wavelet additive synthesis, CWAS), which is based on the complex continuous wavelet transform. When the harmonics from two or more sources overlap in a certain frequency band (or group of bands), a new technique based on amplitude similarity criteria is used to obtain an approximation to the original partial information. The aim is to show that the CWAS algorithm can be a powerful tool in BASS. Compared with other existing techniques, the main advantages of the proposed algorithm are its accuracy in the instantaneous phase estimation, its synthesis capability and that the only input information needed is the mixed signal itself. A set of synthetically mixed monaural isolated notes have been analyzed using this method, in eight different experiments: the same instrument playing two notes within the same octave and two harmonically related notes (5th and 12th intervals), two different musical instruments playing 5th and 12th intervals, two different instruments playing nonharmonic notes, major and minor chords played by the same musical instrument, three different instruments playing nonharmonically related notes and finally the mixture of a inharmonic instrument (piano) and one harmonic instrument. The results obtained show the strength of the technique.
Keywords
Introduction
Blind audio source separation (BASS) has been receiving increasing attention in recent years. The BASS techniques try to recover source signals from a mixture, when the mixing process is unknown. Blind means that very little information is needed to carry out the separation, although it is in fact absolutely necessary to make assumptions about the statistical nature of the sources or the mixing process itself.
In other applications, when a monaural solution is needed (i.e., when M = 1), the mathematical indetermination of the mixture significantly increases the difficulties of the task. Hence, monaural separation is probably the most difficult challenge for BASS, but even in this case, the human auditory system itself can somehow segregate the acoustic signal into separate streams [6]. Several techniques for solving the BASS problem in general (and the monaural separation in particular) have been developed.
Psychoacoustic studies, such as computational auditory scene analysis [7, 8], inspired by auditory scene analysis [6], attempts to explain the mentioned capability of the human auditory system in selective attention. Psychoacoustic also suggests that temporal and spectral coherence between sources can be used to discriminate between them [9]. Within the statistical techniques, independent component analysis (ICA) [10, 11] assumes statistical independence among sources, while independent subspace analysis [12] extends ICA to singlechannel source separation. Sparse decomposition [13] assumes that a source is a weighted sum of bases from an overcomplete set, considering that most of these bases are inactive most of the time [14], that is, their relative weights are presumed to be mostly zero. Nonnegative matrix factorization [15, 16] attempts to find a mixing matrix (with sparse weights [17, 18]) and a source matrix with nonnegative elements so that the reconstruction error is minimized.
Finally, sinusoidal modeling techniques assume that every sound is a linear combination of sinusoids (partials) with timevarying frequencies, amplitudes, and phases. Therefore, sound separation requires a reliable estimation of these parameters for each source present in the mixture [19–21], or some a priori knowledge, i.e., rough pitch estimates of each source [22, 23]. One of the most important applications is monaural speech enhancement and separation [24]. These are generally based on some analysis of speech or interference and subsequent speech amplification or noise reduction. Most authors have used STFT to analyze the mixed signal in order to obtain its main sinusoidal components or partials. Auditorybased representations [25] can also be used.
One of the most important and difficult problems to solve in the separation of pitched musical sounds is overlapping harmonics, that is, when frequencies of two harmonics are approximately the same. The problem of overlapping harmonics has been studied during the past decades [26], but it is only in recent years that there has been a significant increase in research on this topic. Given that the information in overlapped regions is unreliable, several recent systems have attempted to utilize the information from neighboring nonoverlapped harmonics. Some systems assume that the spectral envelope of the instrument sounds is smooth [27–29]; hence, the amplitude of an overlapped harmonic can be estimated from the amplitudes of nonoverlapped harmonics from the same source, via weighted sum [20], or interpolation [21, 27]. The spectral smoothness approximation is often violated in real instrument recordings. A different approximation is known as the common amplitude modulation (CAM) [22], which assumes that the amplitude envelopes of different harmonics from the same source tend to be similar. The authors of [30] propose an alternate technique for harmonic envelope estimation, called harmonic temporal envelope similarity (HTES). They use the information from the nonoverlapped harmonics of notes of a given instrument, wherever they occur in a recording, to create a model of the instrument which can be used to reconstruct the harmonic envelopes for overlapped harmonics, allowing separation of completely overlapped notes. Another option is the average harmonic structure (AHS) model [31] which, given the number of sources, creates a harmonic structure model for each present source, using these models to separate notes showing overlapping harmonics.
In this study, we use an experimentally less restrictive version of the CAM assumption within a sinusoidal model generated using a complex band pass filtering of the signal. Nonoverlapping harmonics are obtained using a binary masking approach obtained from the complex wavelet additive synthesis (CWAS) algorithm [32], which is based on the complex continuous wavelet transform (CCWT). The main advantage of the proposed technique is the capability of synthesis of the CWAS algorithm. Using the CWAS wavelet coefficients, it is possible to synthesize an output signal which differs negligibly (numerically and acoustically) from the original input signal. Hence, the nonoverlapped partials can be obtained with accuracy. The separated amplitudes of overlapping harmonics are reconstructed proportionally from the nonoverlapping harmonics, following energy criteria in a leastsquares framework. This way, it is possible to relax the phase restrictions, and the instantaneous phase for each overlapping source can also be constructed from the phase of nonoverlapping partials. At its current stage, the proposed technique can be used to separate two or more musical instruments, each one playing a single note.
The rest of the article is divided as follows. “Complex bandpass filtering” section provides a brief introduction to the CCWT and the CWAS algorithms, including the interpretation of their results and the additive synthesis process. The proposed separation algorithm and its main blocks (as the fundamental frequency estimation) will be presented in “Separation algorithm’ section, with a detailed example. The numerical results of the different experiments and tests are shown in “Experimental results” section. Finally, the main conclusions and current and future lines of work are presented in “Conclusions” section.
Complex bandpass filtering
The CCWT
where C is a normalization constant which can be calculated independently of the input signal in order to conserve the energy of the transform [34].
This result can locally be applied to every detected partial of the analyzed signal, providing a model of the audio signal close to its canonical pair. The output (synthetic) signal is the real part of ρ(t) (the real part of the additive synthesis of the detected partials in the general case). This synthetic signal remains very close to the original input signal x(t) in numerical and acoustical terms [32].
The CWAS algorithm
In the CWAS algorithm [32], a complex mother wavelet allows us to analyze the complex coefficients of Equation (1), stored in a matrix (the CWT matrix), in module and phase, obtaining directly the instantaneous amplitude and the instantaneous phase of each detected component [34, 35]. A single parameter, the number of divisions per octave D (a vector with as many dimensions as octaves present in the signal’s spectrum), controls the frequency resolution of the analysis.
where ${W}_{x}({a}_{{m}_{i}},t)$ are the wavelet coefficients W_{ x }(a_{ m },t), related with the i th peak (partial).
where t_{ m }is the m th sample of the temporal duration of the partial i (whose length is l_{ i }, in samples). Obviously, E_{ i } is a measure of the energy of the partial.
The objective of this study is to be able to use this information to somehow separate a signal composed of two or more mixed notes into the original isolated sources. The only input of the system is the mixed signal (no additional data is needed).
BASS
As stated above, BASS attempts to obtain the original isolated sources s_{ k }(t) present in a certain signal x(t), when the mixture process is unknown.
As we do not know a priori the number N of sources present in x(t), the first problem is to divide the detected partials into as many different families or categories as sources, having a minimum error between members of a class [19]. A first approximation to the BASS task using the CWAS technique was performed and presented in [37]. There, we used an onset detection algorithm [38] to find a rough division of the partials, grouping them into the different sources. The main advantage of using the CWAS algorithm instead of the STFT is its proven ability of highquality resynthesis. As explained, the time and frequency errors in the synthesis of signals using the CWAS algorithm is remarkably small, and the acoustical differences between the original and synthetic signals are negligible for most of people [32]. This high fidelity synthesis converts the CWAS algorithm in a very useful tool for source separation.
In the general case, when there are two or more audio sources present in the analyzed signal, a certain partial can be part of one of the sources, it can be shared by two or more sources, or it can be part of none of them (i.e., inharmonic or noisy partials). The algorithm will search for any fundamental frequency present in the mixed signal, and each f_{0} will be considered as an indicator of the presence of a source (see “Multiple f_{ 0 } estimation” section). A harmonic analysis will find the set of partials which belongs to each source, and the set of overlapping partials (and which sources are overlapping for each case). Then, the information of the isolated partials will be used to reconstruct an estimation of the contribution from each source to every overlapping partial, and the separated sources will be generated by additive synthesis (see “The separation process” section). This idea was used in [22], but in this study the only input information is the mixed signal (we do not need the estimated pitch, because the f_{0} estimator gives us this information). The quality of the separation (see “Qualityseparation measurement” section) will be measured using the standards proposed in [39].
Separation algorithm
The waveform, module of the CWT matrix, and scalogram of this signal can be seen in Figure 2. The numerical quality separation measurement of this signal can be seen in the following section. In the example, we will concentrate on a single overlapping partial. The isolated original partials will also be used to test the robustness of the method.
The main steps of the separation algorithm are summarized below.

From x(t)→P_{ i }(t)→A_{ i }(t), Φ_{ i }(t) (CWAS).

From Φ_{ i }(t), through Equation (7) →f_{ i }(t).

Estimation of f_{0k}and their harmonic partials ∀k.

Separation of overlapping partials.

Additive synthesis →s_{ k }(t).
It is important to remark that, at its actual stage, the separation process is performed using the information of the whole signal.
Multiple f_{ 0 }estimation
In this study, we have considered that a musical instrument cannot play more than one note simultaneously (i.e., we work mainly with monophonic instruments). If an instrument plays two or more notes simultaneously (polyphony), the developed algorithm will consider that each note comes from a different source. With such an approximation, the present fundamental frequencies f_{0j}, j = 1,…,N become the natural parameter which will be used to calculate the number of sources present in the mixture, and the reliability in the f_{0}estimator acquires capital importance.
The input (mixed) signal is analyzed using the CWAS algorithm, which provides as results the n complex functions that define the temporal evolution of each detected partial. Using Equations (7) and (8), the instantaneous frequencies for each partial (and their respective average values, $\overline{{f}_{j}}$∀j = 1,…,n) and the energy distribution of the signal are obtained. This information is equivalent to the scalogram of the signal clustered around the set of detected partials. Only the partials with energy greater than the threshold E_{ th }= 1% will be considered in the search of the harmonic sets associated with each source. From the remaining energy distribution, the most energetic partial (MIP in Figure 4) is selected, and the harmonic analysis is computed next.
where N_{ k } is the higher natural such that satisfies N_{ k }f_{0k}≤ f_{ s }/2, being f_{ s }the sampling rate.
where θ_{ a } is the inharmonicity threshold. Taking θ_{ a }= 0.03, the partials of an inharmonic instrument like the piano are correctly analyzed.
where n_{a,k} is the total number of harmonics associated with f_{0k} and n_{ip,k} is the number of partials with energy above the threshold E_{ th }. E_{i,k} is the energy of the i th partial associated with f_{0k}.
Substituting these new energy values into its corresponding partials of the original energy distribution, a new MIP can be obtained. The process is iterated until the energy of the distribution descends under a threshold or the maximum number of sources (MNS in Figure 4) has been reached. In this study, we have limited the number of sources to MNS = 5. Using this technique, it is possible to obtain the fundamental frequencies even in the most difficult cases, for example when a fundamental frequency is overlapped with a harmonic corresponding to other source or in the case of suppressed fundamentals. Overlapping fundamentals will not be detected using this technique.
Accuracy results of the fundamental frequency estimation algorithm
Analyzed  Succes.  Estim.  

signals (#)  det. (#)  error (%)  
1 instr.  106  106  0 
2 instr.  75  74  1.34 
1H + 1I instr.  4  4  0 
3 instr.  50  49  2 
Total  235  233  0.85 
The separation process
Analyzing the sets of harmonic partials for each source, it is easy to distinguish between isolated harmonics (that is, partials which only belong to a single source) and overlapping harmonics (partials shared by two or more sources). The isolated harmonics and the fundamental partial of each source will be used later to separate the overlapping partials, through their onset and offset times, instantaneous envelopes, and phases. The separated source is eventually synthesized by the additive synthesis of its related set of partials (isolated and separated).
The inharmonic limit
where $\delta ={(m{f}_{n}/n{f}_{m})}^{2}$ and ε is an induced error due to the physical structure of the piano which cannot be evaluated [42]. If partials m and n are correctly selected, ε ≈ 1.
With the inharmonic model of Equation (19), it is possible to calculate the inharmonicity parameter β for each detected source, using (when possible) two isolated partials situated in the appropriate octaves. A priori, this technique includes inharmonic instruments (like piano) in the proposed model. Unfortunately, the obtention of the parameter β do not improve significantly the quality separation measurements evaluated in the tests.
Assumptions
In order to obtain the envelopes and phases of an overlapping partial related to each source, we will assume two approximations. The first one is a slightly less restrictive version of the CAM principle, which asserts that the amplitude envelopes of spectral components from the same source are correlated [22].

The amplitudes (envelopes) of two harmonics P_{1} and P_{2}, with similar energy E_{1} ≈ E_{2}, both belonging to the same source, have a high correlation coefficient.
As long as this approximation is true, we will have better separation results. As we are using the global signal information, the correlation coefficient between the strongest harmonic (and/or the fundamental partial) and the other harmonics decreases as the amplitude/energy differences between the involved partials increase [22]. Hence, the choice for the reconstruction of nonoverlapping harmonics whose presence is energetically similar to the energy of the overlapping harmonic suggests that the correlation factor between the involved partials will be higher. In fact, as the correlation between highenergy partials tends also to be high, while the errors related with this assumption in lower energy partials tend to be energetically negligible, in most cases the quality measurement parameters have a high value, and the acoustic differences between the original and the separated source are acceptable.
The second approximation is

The instantaneous phases of the p th and the q th harmonic partials belonging to the same source are approximately proportional with ratio p/q, except an initial phase gap, ϕ_{0}. That is${\varphi}_{2}\left(t\right)\approx \frac{p}{q}{\varphi}_{1}\left(t\right)+\Delta {\varphi}_{0}$(20)
where Δ ϕ_{0} = 0 means that the initial phases of the involved partials are equal, that is, ϕ_{0p}= ϕ_{0q}.
We have found that in our model of the audio signal and even knowing the envelopes of the original overlapping harmonics, a difference in the initial phase $\Delta {\varphi}_{0}=1{0}^{3}$ is enough to make impossible an adequate reconstruction of the mixed partial. Each partial has an aleatory initial phase (i.e., there is not a relation between ϕ_{0p}and ϕ_{0q}). However, as the instantaneous frequency of the mixed harmonics can be retrieved with accuracy independently of the value of the initial phase, the original and the synthetically mixed partials (using the separated contribution from each source) present similar sounds (provided that the first assumption is true).
Reconstruction process and additive synthesis
Using the information of the isolated partials and through an onset detection algorithm [38], it is easy to detect the beginning and the end of each present note. This information is necessary to avoid the artifacts and/or noise caused by the mixture process which tends to appear before and after active notes. This noise is acoustically annoying and makes worse the numerical quality separation measurement results.
where ${P}_{{s}_{k}}\left(t\right)$ are the original harmonics which overlap in the mixed partial. In Equation (22), the only accessible information is the instantaneous amplitude and phase of the mixed partial, that is, A_{ m }(t) and ϕ_{ m }(t). The aim is to recover each ${A}_{{s}_{k}}\left(t\right)$ and ${\varphi}_{{s}_{k}}\left(t\right)$ as accurately as possible.
for some p_{ k }, q_{ k }in$\mathbb{N}$.
Hence, it is possible to use Equation (20) to reconstruct the phases ${\varphi}_{{s}_{k}}$ of the separated partials for each overlapping source.^{b}
where A is a matrix which contains the envelopes of each selected (winner) partial described by Equations (23) and (24), α is the mixture vector and b = A_{ m }(t).
Once each separated partial is obtained using the technique described, it is added to its corresponding source. This iterative process eventually results in the separated sources.
In the figures, the separated wavelet spectrogram shows that only the harmonic partials have been recovered. When the inharmonic partials carry important (non noisy) information, the synthetic signal can sound somewhat different (as happened with the possible envelope errors in the highfrequency partials).
The values of the standard quality measurement parameters for this example and the rest of the analyzed signals will be detailed in “Summarizing: graphical results” section.
Main characteristics, advantages, and limitations
The reconstruction of overlapping partials causes that there is no information wrongly assigned to the separated sources using this technique, except the existing interference in the set of isolated partials. This means that the interference terms in the separation process will be in general negligible. This result will be numerically confirmed in “Experimental results’’ section.
The advantages of this separation process are mainly two. First, the process of separation of overlapping harmonics (multipitch estimation, calculus of the best linear combination for reconstruction, additive synthesis) is not computationally expensive. In fact, the obtention of the wavelet coefficients and their separation into partials uses much more computation time. The second advantage of this process is that the separation is completely blind. That is, we do not need any a priori characteristic of the input signal, neither the pitch contour of the original sources nor the relative energy, number of present sources, etc.
One of the most important limitations of this method is that is not valid for separating completely overlapping notes. Although the detailed algorithm of estimation of fundamental frequencies is capable of detecting overlapping fundamentals, in such a case the set of isolated partials of the overlapped source would be essentially empty, and therefore no isolated information would be available to carry out the reconstruction of phases and amplitudes of the corresponding source. To solve this problem (assuming the separation of musical themes of longer duration), it is possible to use models of the instruments present in the mixture, or previously separated notes from the same source. These ideas are the basis of HTES and AHS techniques (see “Introduction” section).
On the other hand, as was advanced in “Introduction” section, at its current stage, the proposed technique can be used to separate two or more musical instruments, each one playing a single note. The final quality of the separation depends of the number of mixed sources. This is due to the accuracy of the estimation of fundamental frequencies, and to the use of isolated partials to reconstruct the overlapping harmonics. The higher the number of sources, the lower the number of isolated harmonics and the poorer the final musical timbre of the separated sources.
Experimental results
The analyzed set of signals includes approximately 100 signals with two sources and 60 signals with three sources. All the analyzed signals are real recordings of musical instruments, most of them extracted from [41]. The final set of musical instruments includes flute, clarinet, sax, trombone, trumpet, oboe, bassoon, horn, tuba, violin, viola, guitar, and piano.
All the analyzed signals have been subsampled to f_{ s }= 22050 Hz, then synthetically mixed. The number of divisions per octave D and all the thresholds used in the CWAS and the separation algorithms are the same for all the analyzed signals. Specifically, D = {16;32;64;128;128;100;100;100;100}, θ_{ th }= 0.03, E_{ th }= 1%. Observe that the number of divisions per octave depends on the octave, so we have a variable resolution.
List of BASS experiments developed
Experiment  Sources  Instruments  Experiment 

(#)  (#)  involved  characteristics 
1  2  Different  1 Harm. +1 Inharm. 
2  2  Same  Same octave 
3  2  Same  5th & 12th intervals 
4  2  Different  5th & 12th intervals 
5  2  Different  Inharmonic notes 
6  3  Same  Major chord 
7  3  Same  Minor chord 
8  3  Different  Inharmonic notes 
Experiment 1: harmonic and inharmonic instruments
Experiment 2: single instrument, same octave
In the second test, two musical instruments (Alto Sax and Flute, respectively) were taken randomly from the original database. We have generated a total of 11 signals with each instrument, with two notes of the fourth octave (considering A 4 = 440 Hz) played by the same instrument. One of the notes is always a C# 4 (277 Hz), the other note corresponds to the same octave (C 4, D 4, D# 4, etc.). The experimental values of SDR, SIR, and SAR are presented in the second column of Figures 11, 12, and 13.
Experiment 3: single instrument, harmonicrelated notes
In the third experiment, we mixed two harmonic note intervals from the same instrument. The used harmonic relations are: C−G, D−A, E−B, F−C, G−D, A−E, and A#−F from the same or different octave. That is, 5th and 12th intervals. We have generated three sets of signals, each one corresponding to one musical instrument (concretely, Alto Sax, Flute and Bb Clarinet), and seven mixtures from each one. Numerical results of this experiment are shown in the third column of Figures 11, 12, and 13.
Experiment 4: two instruments, harmonicrelated notes
In the next experiment, we have mixed in 20 signals the same harmonic intervals of the previous experiment, this time executed by different musical instruments: Alto sax, guitar, bassoon, Bb and Ee clarinets, horn, oboe, and flute. The experimental values of the quality separation measurement are presented in the fourth column of Figures 11, 12, and 13.
Experiment 5: two instruments, inharmonic notes
In this experiment, each analyzed signal contains the mixture of two aleatory chosen musical instruments playing aleatory (nonharmonically related) notes. The experimental values of the quality separation parameters are presented in the fifth column of Figures 11, 12, and 13.
Experiment 6: one instrument, major chord
A major chord is the mixture of three notes, concretely C−E−G. We have generated 20 of these chords, played by the same musical instrument, concretely Bassoon, Alto Sax, Bb Clarinet, Flute and Trumpet. Numerical data are presented in the sixth column of Figures 11, 12, and 13.
Experiment 7: one instrument, minor chord
A minor chord is the mixture of A−C−E notes. We have analyzed 20 signals, each one played by a single musical instrument: Bassoon, Bb Clarinet, Horn, Oboe, and Trumpet. The SDR, SIR, and SAR values for this experiment are depicted in the seventh column of Figures 11, 12, and 13.
Experiment 8: three instruments, inharmonic notes
Finally, 20 signals with three aleatory instruments playing aleatory (nonharmonically related) notes have been analyzed. These signals are randomly distributed from octaves 2 to 6, and 10 of the signals present widely separated notes. The experimental values of the quality separation measurement parameters are presented in the last column of Figures 11, 12, and 13.
Quality separation measurement
We will assume that the errors committed in the separation process can have three different origins: they can be due to interference between sources, to distortions inserted in the separated signal, and to artifacts introduced by the separation algorithm itself.
where D_{interf}, D_{total}, and D_{artif} are energy ratios involving the separated signals and the target (isolated, supposed known) signals. The quality separation measurements of the next sections have been obtained within the MATLAB^{®} toolbox BSS_EVAL, developed by Févotte, Gribonval, and Vincent and distributed online under the GNU Public License [44].
Summarizing: graphical results
As advanced before, in Figures 11, 12, and 13, we show the numerical results of the detailed tests. In Figure 11, the experimental values of the SDR parameter for each experiment are presented. In Figure 12, we have depicted the obtained SIR values. Finally, in Figure 13, the experimental values of the SAR parameter are shown.
In Figure 11, marked with squares, the SDR mean result for each test; with triangles, the maximum and minimum value of the parameter. These results show significant differences in the quality separation measurements for the experiments of separation involving two sources. In the case of experiments with three sources, the differences are smaller.
In Figure 12, the SIR mean result for each test is marked with circles; with triangles, the maximum and minimum value of the parameter. As can be seen in the figure, the experimental values of SIR present less variations than in the previous case. It means that the proposed technique does not present significative tendency to high interference terms.
Finally, in Figure 13, the SAR results for each test are marked with stars. The maxima and minima of the experiments are depicted with triangles. The conclusions are the same that in Figure 11.
If we consider globally the whole set of signals with two mixed sources, the mean values of the quality separation measurement parameters can be used in some way to measure the final quality of the separation. These values (represented in Figures 11, 12, and 13 with horizontal dasheddotted lines) are

$\overline{\mathit{\text{SD}}{R}_{2s}}\approx $ 16.07 dB.

$\overline{\mathit{\text{SI}}{R}_{2s}}\approx $ 58.85 dB.

$\overline{\mathit{\text{SA}}{R}_{2s}}\approx $ 16.08 dB.
The average of the standard parameters in the case of three mixed sources (horizontal dashed lines in Figures 11, 12, and 13) are

$\overline{\mathit{\text{SD}}{R}_{3s}}\approx $ 12.81 dB.

$\overline{\mathit{\text{SI}}{R}_{3s}}\approx $ 52.03 dB.

$\overline{\mathit{\text{SA}}{R}_{3s}}\approx $ 12.82 dB.
These results are consistent with the increasing number of sources in the mixture. Under the same degree of precision in the frequency axis, the higher the number of sources, the lower the separation between partials and the higher the probability of interference (lower SIR). Hence, the final distortions and artifacts tend to increase.
Conclusions
In this study, a BASS technique for monaural musical notes has been presented. There are two main differences between the proposed algorithm and the existing ones: first of all, the time–frequency analysis tool is not based on the STFT but in the CCWT, which offers a highly coherent model of the audio signal in both time and frequency domains. This tool allows us to obtain with great accuracy the instantaneous evolution (in time and frequency) of the isolated harmonics, easily assignable to the sources present in the mixture. Second, the separation algorithm only needs the mixed signal as input, no additional information is needed. The overlapping partials can entirely be reconstructed from the isolated partials searching for the best linear combination which minimizes the amplitude error in the mixture process, assuming the CAM principle. Using nonoverlapping partials with similar energy to the overlapping partials, if the overlapping partial has high energy, the correlation factor tends to be high, and if the energy is low, errors associated with the low correlation are usually acceptable. The phase reconstruction is not as important as in other techniques, obtaining separated sources which have both highquality separation measurement values and highacoustic resemblance with respect to the original signals.
At its actual stage, the proposed technique can be used to separate two or more (monophonic) sources playing a single (and no proportional) note each. As the polyphony of the mixture signal increases, the acoustic performance of the separated signals tend to show a less resemblance timbre with respect to the original signals, because the set of isolated partials is decreasing in number of elements and therefore, in the reconstruction, the information used is smaller and less varied. Regarding the results of numerical quality, the SDR and SAR parameters descend with respect to the shown results from polyphony 5, while the SIR parameter, although it has a clear downward trend, remains high.
To develop a complete source separation algorithm, several improvements are needed.
First, it is necessary to implement this technique into an algorithm frametoframe to address the separation of long duration signals. The fundamental frequency, onset, and offset estimation algorithms presented in “Separationalgorithm” section and [38] are able to work dynamically, obtaining the parameters of pitch, starting, and ending time of each note present in the mixture.
There are several useful techniques to properly assign each separated note to its corresponding source. For example, to use a rough estimation of the pitches of the mixture [22] or the score of the analyzed signal. Other possibility is to develop an algorithm of timbre classification. This method has the advantage of maintaining the blindness of the system, but the drawback of a potential loss of generality. Both methods could also be used to solve the limitation of the presented technique for the separation of polyphonic instruments.
Finally, as discussed briefly in “Main characteristics, advantages, and limitations” section, the appearance of completely overlapping notes is statistically inevitable in real recordings. This problem (one of the core problems in BASS) must be addressed to develop a complete separation algorithm. Therefore, future challenges remain to be tackled.
Endnotes
^{a}Each original archive consists of a certain number of notes. Each note is approximately 2s long and is immediately preceded and followed by ambient silence. The instruments are recorded in an anechoic chamber. Some instruments are recorded with and without vibrato. All samples are in mono, 16 bit, 44.1 kHz, AIFF format. Resampled at 16 bits, 22.05 kHz, wav format, excerpts consist of isolated notes. Some of these notes have synthetically been mixed. ^{b}We will suppose Δ ϕ_{0} = 0 in Equation (20), but in fact an aleatory initial phase can be inserted without any significant difference in either the numeric or in the acoustical results.
Declarations
Acknowledgements
This study was supported by the Spanish government project TEC200914414C0301 (Analysis, Classification and Separation of Sound Sources, Ancla S^{3}v2.0). Many thanks to the reviewers for their insightful comments.
Authors’ Affiliations
References
 Rickard S: Blind Speech Separation, chapter 8. The DUET Blind Source Separation Algorithm. (Springer, Netherlands, 2007), pp. 217–241Google Scholar
 Melia T: Underdetermined blind source separation in echoic environments using linear arrays and sparse representations. Ph.D. thesis, School of Electrical, Electronic and Mechanical Engineering University College Dublin, National University of Ireland, 2007MATHGoogle Scholar
 Cobos M, Lrópez JJ: Stereo audio source separation based on timefrequency masking and multilevel thresholding. Dig. Signal Process 2008, 18: 960976.View ArticleGoogle Scholar
 Cobos M: Application of sound source separation methods to advanced spatial audio systems. Ph.D. thesis, Universidad Politrécnica de Valencia, 2009Google Scholar
 Yilmaz Ö, Rickard S: Blind separation of speech mixtures via timefrequency masking. IEEE Trans. Signal Process 2004, 52(7):18301847.MathSciNetView ArticleGoogle Scholar
 Bregman AS: Auditory Scene Analysis: The perceptual organization of sound. (MIT Press, Boston, 1990)Google Scholar
 Brown GJ, Cooke M: Computational auditory scene analysis. Comput. Speech Lang. Elsevier 1994, 8(4):297336.View ArticleGoogle Scholar
 Wang D, Brown GJ: Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. (WileyIEEE Press, Hoboken, 2006)View ArticleGoogle Scholar
 Cauwenberghs G: Monaural separation of independent acoustical components. Proceedings of the 1999 IEEE International Symposium on Circuits and SystemsISCAS ’99 vol. 5, (Orlando, Florida, USA),1999, pp. 62–65Google Scholar
 Amari S, Cardoso JF: Blind source separation—semiparametric statistical approach. IEEE Trans. Signal Process 1997, 45(11):26922700.View ArticleGoogle Scholar
 Cardoso JF: Blind signal separation: statistical principles. Proc. IEEE 1998, 86: 20092025.View ArticleGoogle Scholar
 Casey MA, Westner W: Separation of mixed audio sources by independent subspace analysis. Proceedings on International Computer Music Conference, vol. 2000, (Berlin, Germany, 2000), pp. 1–8Google Scholar
 Jafari MG, Abdallah SA, Plumbey MD, Davies ME: Sparse coding for convolutive blind audio source separation. Lecture Notes in Comput. Sci.Independent Component Anal. Blind Signal Sep 2006, 3889: 132139.View ArticleMATHGoogle Scholar
 Abdallah SA: Towards music perception by redundancy reduction and unsupervised learning in probablistic models. PhD thesis, King’s College London, 2002Google Scholar
 Virtanen T: Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio Speech Lang. Process 2007, 15(3):10661074.View ArticleGoogle Scholar
 Schmidt NM, Mørup M: Nonnegative matrix factor 2D deconvolution for blind single channel source separation. Proceedings of the 6th International Conference on Independent Component Analysis and Blind Signal Separation, ICA’06, Lecture Notes in Computer Science, vol. 3889, (Charleston, SC, USA), 2006, pp. 700–707View ArticleMATHGoogle Scholar
 Lee DD, Seung HS: Learning the parts of objects by nonnegative matrix factorization. Nature 1999, 401: 788791.View ArticleGoogle Scholar
 Schmidt MN, Olsson RK: Singlechannel speech separation using sparse nonnegative matrix factorization. Internationnal Conference on Spoken Languaje Processing, ICSLP’06, (Pittsburgh, Pennsylvania, USA) 2006, pp. 2614–2617Google Scholar
 Virtanen T, Klapuri A: Separation of harmonic sound sources using sinusoidal modeling. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal ProcessingICASSP ’00, vol. 2, (Istanbul, Turkey) 2000, pp. 765–768Google Scholar
 Virtanen T: Sound source separation in monaural music signals. PhD thesis, Tampere University of Technology, 2006Google Scholar
 Every MR, Szymanski JE: Separation of synchronous pitched notes by spectral filtering of harmonics. IEEE Trans. Audio, Speech Lang. Process 2006, 14(5):18451856.View ArticleGoogle Scholar
 Li Y, Woodruff J, Wang D: Monaural musical sound separation based on pitch and common amplitude modulation. Trans. Audio, Speech Lang. Process 2009, 17(7):13611371.View ArticleGoogle Scholar
 Woodruff J, Li Y, Wang D: Resolving overlapping harmonics for monaural musical sound separation using pitch and common amplitude modulation. Proceedings of the International Conference on Music Information Retrieval, (Philadelphia, Pennsylvania, USA), 2008, pp. 538–543Google Scholar
 Hu G, Wang D: Monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Trans. Neural Netw 2004, 15(5):11351150.View ArticleGoogle Scholar
 Burred JJ, Sikora T: On the use of auditory representations for sparsitybased sound source separation. Fifth International Conference on Information, Communications and Signal Processing, ICICS05, (Bangkok, Thailand) 2005, pp. 1466–1470Google Scholar
 Parsons TW: Separation of Speech from interfering speech by means of harmonic selection. J. Acoust. Soc. Am 1976, 60(4):911918.View ArticleGoogle Scholar
 Virtanen T, Klapuri A: Separation of harmonic sounds using multipitch analysis and iterative parameter estimation. IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics, (New Paltz, NY, USA) 2001, pp. 83–86Google Scholar
 Klapuri A: Multipitch estimation and sound separation by the spectral smoothness principle. Proceedings of the IEEE International Conference on Acoustic, Speech, and Signal Processing (ICASSP’01), vol. 5, (Salt Lake City, Utah, USA) 2001, pp. 3381–3384View ArticleGoogle Scholar
 Klapuri A: Multiple fundamental frequency estimation based on harmonicity and spectral smoothness. IEEE Trans. Speech Audio Process 2003, 11(6):804816.View ArticleGoogle Scholar
 Han J, Pardo B: Reconstructing completely overlapped notes from musical mixtures. Proceedings of the IEEE International Conference on Acoustics, Speech ans Signal Processing (ICASSP’11), (Evanston, IL, USA) 2011, pp. 249–252Google Scholar
 Duan Z, Zhang Y, Zhang C, Shi Z: Unsupervised singlechannel music source separation by average harmonic structure modeling. IEEE Trans. Audio Speech Lang. Process 2008, 16(4):766778.View ArticleGoogle Scholar
 Beltrán JR, Ponce de León J: Estimation of the instantaneous amplitude and the instantaneous frequency of audio signals using complex wavelets. Signal Process 2010, 90(12):30933109.View ArticleMATHGoogle Scholar
 Daubechies I: Ten lectures on wavelets, vol. 61 of CBMSNSF Regional Conference Series in Applied Mathematics,. CBMSNSF Series Appl. Math., SIAM, (Pasadena, California, USA), 1992Google Scholar
 Beltrán JR, Ponce de León J: Analysis and synthesis of sounds through complex bandpass filterbanks. Proc. of the 118th Convention of the Audio Engineering Society (AES’05), (Preprint 6361), (Barcelona, Spain), May 2005Google Scholar
 Beltrán JR, Ponce de León J: Extracción de Leyes de Variación Frecuenciales Mediante la Transformada Wavelet Continua Compleja. Proceedings of the XX Simposium Nacional de la Unión Científica Internacional de Radio (URSI’05), (Valencia, Spain), 2005Google Scholar
 Boashash B: Estimating and interpreting the instantaneous frequency of a signal. Part 1: fundamentals. Proc. IEEE 1992, 80(4):520538.View ArticleGoogle Scholar
 Beltrán JR, Ponce de León J: Blind source separation of monaural musical signals using complex wavelets. Proceedings of the 12th International Conference on Digital Audio Effects (DAFx09), (Como, Italy) 2009, pp. 353–358Google Scholar
 Beltrán JR, Ponce de León J, Degara N, Pena A: Localización de Onsets en Señales Musicales a través de Filtros Pasobanda Complejos. Proceedings of the XXIII Simposium Nacional de la Unión Científica Internacional de Radio (URSI’08), (Madrid, Spain), 2008Google Scholar
 Gribonval R, Vincent E, Févotte C, Benaroya L: Proposals for performance measurement in source separation. Proceedings of the International Conference on Independent Component Analysis and Blind Source Separation (ICA), (Nara, Japan), 2003, pp. 763–768Google Scholar
 PérezSancho C, Rizo D, Illescas JM: Genre classification using chords and stochastic language models. Connection Sci 2009, 21(23):145159.View ArticleGoogle Scholar
 Fritts L: Electronic Music Studios. University of Iowa, Musical Instrument Samples Database, , [Online] http://theremin.music.uiowa.edu/MIS.html University of Iowa, Musical Instrument Samples Database, , [Online]
 OrtizBerenguer LI, CasajúsQuirós FJ, TorresGuijarro M, Beracoechea JA: Piano transcription using pattern recognition: aspects on parameter extraction. Proceedings of the 7th Conference on Digital Audio Effects (DAFx’04), (Naples, Italy), 2004, pp. 212–216Google Scholar
 OrtizBerenguer LI: Identificacirón Automrática de Acordes Musicales. PhD thesis, Escuela Trécnica Superior de Ingenieros de Telecomunicacirón, Universidad Politrécnica de Madrid, 2002Google Scholar
 Févotte C, Gribonval R, Vincent E: BSS EVAL Toolbox User Guide—Revision 2.0. Technical Report, IRISA Technical Report 1706, Rennes, France, 2005Google Scholar
 Vincent E, Gribonval R, Févotte C: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process 2006, 14(4):14621469.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.