Joint estimation methods generate and evaluate competing sets of f0 combinations in order to select the most plausible combination directly. This scheme, recently introduced in [24, 25] has the advantage that the amplitudes of overlapping partials can be approximated taking into account the partials of the other candidates for a given combination. Therefore, partial amplitudes can depend on the particular combination to be evaluated, opposite to an iterative estimation scheme like matching pursuit, where a wrong estimate may produce cumulative errors.
The core method performs a frame by frame analysis, selecting the most likely combination of fundamental frequencies at each instant. For this purpose, a set of f0 candidates are first identified from the spectral peaks. Then, a set of possible combinations, , of candidates are generated, and a joint algorithm is used to find the most likely combination.
In order to evaluate a combination, hypothetical partial sequences HPS (term proposed in [26] to refer to a vector containing hypothetical partial amplitudes) are inferred for its candidates. In order to build these patterns, harmonic interactions with the partials of the other candidates in the combination are considered. The overlapped partials are first identified, and their amplitudes are estimated by linear interpolation using the non-overlapped harmonic amplitudes.
Once patterns are inferred, they are evaluated taking into account the sum of its hypothetical harmonic amplitudes and a novel smoothness measure.
Combinations are analysed considering their individual candidate scores, and the most likely combination is selected at the target frame.
The method assumes that the spectral envelopes of the analysed sounds tend to vary smoothly as a function of frequency. The spectral smoothness principle has successfully been used in different ways in the literature [7, 26–29]. A novel smoothness measure based on the convolution of the hypothetical harmonic pattern with a Gaussian window is proposed.
The processing stages, shown in Figure 1, are described below.
2.1 Preprocessing
The analysis is performed in the frequency domain, computing the magnitude spectrogram using a 93 ms Hanning windowed frame with a 9.28 ms hop size. This is the frame size typically chosen for multiple f0 estimation of music signals in order to achieve a suitable frequency resolution, and it experimentally showed to be adequate. The selected frame overlap ratio may seem high from a practical point of view, but it was required to compare the method with other studies in MIREX (see 4.3).
To get a more precise estimation of the lower frequencies, zero padding is used multiplying the original window size by a factor z to complete it with zeroes before computing the FFT.
In order to increase the efficiency, many unnecessary spectral bins are discarded for the subsequent analysis using a simple peak picking algorithm to extract the hypothetical partials. At each frame, only those spectral peaks with an amplitude higher than a threshold μ are selected, removing the rest of spectral information and obtaining this way a sparse representation containing a subset of spectral bins. It is important to note that this thresholding does not have a significant effect on the results, as values of μ are quite low, but the efficiency of the method importantly increases.
2.2 Candidate selection
The evaluation of all possible f0 combinations in a mixture is computationally intractable, therefore a reduced subset of candidates must be chosen before generating their combinations. For this, candidates are first selected from the spectral peaks within the range [fmin, fmax] corresponding to the musical pitches of interest. Harmonic sounds with missing fundamentals are not considered, although they seldom appear in practical situations. A minimum spectral peak amplitude ε for the first partial (f0) can also be assumed in this stage.
The spectral magnitudes at the candidate partial positions are considered as a criterion for candidate selection as described next.
2.2.1 Partial search
Slight harmonic deviations from ideal partial frequencies are common in music sounds, therefore inharmonicity must be considered for partial search. For this, a constant margin around each harmonic frequency f
h
± f
r
is set. If there are no spectral peaks within this margin, the harmonic is considered to be missing. Besides considering a constant margin, frequency dependent margins were also tested assuming that partial deviations in high frequencies are larger than those in low frequencies. However, results decreased, mainly because many false positive harmonics (most of them corresponding to noise) can be found in high frequencies.
Different strategies were also tested for partial search, and finally, like in [30], the harmonic spectral location and spectral interval principles [31] were chosen in order to take inharmonicity into account. The ideal frequency f
h
of the first harmonic is initialized to f
h
= 2f0. The next ones are searched at fh+1= (f
x
+ f0) ± f
r
, where f
x
= f
i
if the previous harmonic h was found at the frequency f
i
, or f
x
= f
h
if the previous partial was missing.
In many studies, the closest peak to f
h
within a given region is identified as a partial. A novel variation which experimentally slightly increased (although not significantly) the proposed method performance is the inclusion of a triangular window. This window, centered in f
h
with a bandwidth 2f
r
and a unity amplitude, is used to weight the partial magnitudes within this range (see Figure 2). The spectral peak with maximum weighted value is selected as a partial. The advantage of this scheme is that low amplitude peaks are penalized and, besides the harmonic spectral location, intensity is also considered to correlate the most important spectral peaks with partials.
2.2.2 Selection of F candidates
Once the hypothetical partials for all possible candidates are searched, candidates are ordered decreasingly by the sum of their amplitudes and, at most, only the first F candidates of this ordered list are chosen for the following processing stages.
Harmonic summation is a simple criterion for candidate selection, and other alternatives can be found in the literature, including harmonicity criterion [30], partial beating [30], or the product of harmonic amplitudes in the power spectrum [20]. Evaluating alternative criteria for candidate selection is left as future study.
2.3 Generation of candidate combinations
All the possible combinations of the F selected candidates are calculated and evaluated, and the combination with highest score is yielded at the target frame. The combinations consist of different number of fundamental frequencies. In contrast to studies like [26], there is not need for a priori estimation of the number of concurrent sounds before detecting the fundamental frequencies, and the polyphony is implicitly calculated in the f0 estimation stage, choosing the combination with highest score independently from the number of candidates.
At each frame t, a set of combinations is obtained. For efficiency, like in [20], only the combinations with a maximum polyphony P are generated from the F candidates. The amount of combinations without repetition (N) can be calculated as:
(1)
Therefore, N combinations are evaluated at each frame, so the adequate selection of F and P is critical for the computational efficiency of the algorithm. An experimental discussion on this issue is presented in Sec. 4.2.
2.4 Evaluation of combinations
In order to evaluate a combination , a hypothetical pattern is first estimated for each of its candidates. Then, these patterns are evaluated in terms of their intensity and smoothness, assuming that music sounds have a perceivable intensity and their spectral shapes are smooth, like it occurs for most harmonic instruments. The combination which patterns maximize these measures is yielded at the target frame t.
2.4.1 Inference of hypothetical patterns
The intention of this stage is to infer harmonic patterns for the candidates. This is performed taking into account the interactions with other candidates in the analysed combination, assuming that they have smooth spectral envelopes. A pattern (HPS) is a vector p
c
estimated for each candidate consisting of the hypothetical harmonic amplitudes of the first H harmonics:
(2)
where pc,his the amplitude for the h harmonic of the candidate c. The partials are searched the same way as previously described for the candidate selection stage. If a particular harmonic is not found within the search margin, then the corresponding value pc,his set to zero. As in music sounds the first harmonics are usually the most representative and they contain most of the sound energy, only the first H partials are considered to build the patterns. Once the partials of a candidate are identified, the HPS values are estimated taking into account the hypothetical source interactions. For this task, their harmonics are identified and labeled with the candidate they belong to (see Figure 3). After the labeling process, some harmonics will only belong to one candidate (non-overlapped harmonics), whereas others will belong to more than one candidate (overlapped harmonics).
Assuming that interactions between non-coincident partials (beating) do not alter significantly the original spectral amplitudes, the non-overlapped amplitudes are directly assigned to the HPS. However, the contribution of each source to an overlapped partial amplitude must be estimated.
Getting an accurate estimate of the amplitudes of colliding partials is not reliable only with the spectral magnitude information. In this study, the additivity of linear spectrum is assumed as in most approaches in the literature. Assuming additivity and spectral smoothness, the amplitudes of overlapped partials can be estimated similarly to [26, 32] by linear interpolation of the neighboring non-overlapped partials, as shown in Figure 3 (bottom).
If there are two or more consecutive overlapped partials, then the interpolation is done the same way using the available non-overlapped values. For instance, if harmonics 2 and 3 of a pattern are overlapped, then the amplitudes of harmonics 1 and 4 are used to estimate them by linear interpolation.
After the interpolation, the estimated contribution of each partial to the mixture is subtracted before processing the next candidates. This calculation (see Figure 3) is done as follows:
-
If the interpolated (expected) value is greater than the corresponding overlapped harmonic amplitude, then pc,his set as the original harmonic amplitude, and the spectral peak is completely removed from the residual, setting it to zero for the candidates that share that partial.
-
If the interpolated value is smaller than the corresponding overlapped harmonic amplitude, then pc,his set as the interpolated amplitude, and this value is linearly subtracted for the candidates that share the harmonic.
The residual harmonic amplitudes after this process are iteratively analysed for the rest of the candidates in the combination in ascending frequency order.
2.4.2 Candidate evaluation
The intensity l(c) of a candidate c is a measure of the strength of a source obtained by summing its HPS amplitudes:
(3)
Assuming that a pattern should have a minimum loudness, those combinations having any candidate with a very low absolute (l(c) <η) or relative intensity are discarded.
The underlying hypothesis assumes that a smooth spectral pattern is more probable than an irregular one. This is assessed through a novel smoothness measure s(c) which is based on Gaussian smoothing.
To compute it, the HPS of a candidate is first normalized dividing the amplitudes by its maximum value, obtaining . The aim is to compare with a smooth model built from it, in such a way that the similarity between and will give an estimation of the smoothness.
For this purpose, is smoothed using a truncated normalized Gaussian window , which is convolved with the HPS to obtain :
(4)
Only three components were chosen for the Gaussian window of unity variance, , due to the small size of p
c
, which is limited by H. Typical values for H are within the range H ∈ [5, 20], as only the first harmonics contain most of the energy of a harmonic source.
Then, as shown in Figure 4, a roughness measure r(c) is computed by summing up the absolute differences between and the actual normalized HPS amplitudes:
(5)
The roughness r(c) is normalized into to make it independent of the intensity:
(6)
And finally, the smoothness s(c) ∈ [0, 1] of a HPS is calculated as:
(7)
where H
c
is the index of the last harmonic found for the candidate. This factor was introduced to prevent that high frequency candidates that have less partials than those at low frequencies will have higher smoothness. This way, the smoothness is considered to be more reliable when there are more partials to estimate it.
A candidate score is computed taking into account the HPS smoothness and intensity:
(8)
where κ is a factor that permits to balance the smoothness contribution experimentally.
2.4.3 Combination selection
Once all candidates are evaluated, a salience measure for a combination is computed as:
(9)
When there are overlapped partials, their amplitudes are estimated by interpolation, therefore the HPS smoothness tends to increase. To partially compensate this effect in , the candidate scores are squared in order to boost the highest values. This favors a sparse representation, as it is convenient to explain the mixture with the minimum number of sources. Experimentally, it was found that this square factor was important to improve the success rate of the method (more details can be found at [4, p. 148]). Once computed for all the combinations at , the one with highest score is selected:
(10)