Instrument Identification in Polyphonic Music: Feature Weighting to Minimize Influence of Sound Overlaps

We provide a new solution to the problem of feature variations caused by the overlapping of sounds in instrument identiﬁcation in polyphonic music. When multiple instruments simultaneously play, partials (harmonic components) of their sounds overlap and interfere, which makes the acoustic features di ﬀ erent from those of monophonic sounds. To cope with this, we weight features based on how much they are a ﬀ ected by overlapping. First, we quantitatively evaluate the inﬂuence of overlapping on each feature as the ratio of the within-class variance to the between-class variance in the distribution of training data obtained from polyphonic sounds. Then, we generate feature axes using a weighted mixture that minimizes the inﬂuence via linear discriminant analysis. In addition, we improve instrument identiﬁcation using musical context. Experimental results showed that the recognition rates us-ing both feature weighting and musical context were 84.1% for duo, 77.6% for trio, and 72.3% for quartet; those without using either were 53.4, 49.6, and 46.5%, respectively.


INTRODUCTION
While the recent worldwide popularization of online music distribution services and portable digital music players has enabled us to access a tremendous number of musical excerpts, we do not yet have easy and efficient ways to find those that we want.To solve this problem, efficient music information retrieval (MIR) technologies are indispensable.In particular, automatic description of musical content in a universal framework is expected to become one of the most important technologies for sophisticated MIR.In fact, frameworks such as MusicXML [1], WEDELMUSIC Format [2], and MPEG-7 [3] have been proposed for describing music or multimedia content.
One reasonable approach for this music description is to transcribe audio signals to traditional music scores because the music score is the most common symbolic music representation.Many researchers, therefore, have tried automatic music transcription [4][5][6][7][8][9], and their techniques can be applied to music description in a score-based format such as MusicXML.However, only a few of them have dealt with identifying musical instruments.Which instruments are used is important information for two reasons.One is that it is necessary for generating a complete score.Notes for different instruments, in general, should be described on different staves in a score, and each stave should have a description of instruments.The other reason is that the instruments characterize musical pieces, especially in classical music.The names of some musical forms are based on instrument names, such as "piano sonata" and "string quartet."When a user, therefore, wants to search for certain types of musical pieces, such as piano sonatas or string quartets, a retrieval system can use information on musical instruments.This information can also be used for jumping to the point when a certain instrument begins to play.
This paper, for these reasons, addresses the problem of which facilitates the above-mentioned score-based music annotation, in audio signals of polyphonic music, in particular, classical Western tonal music.Instrument identification is a sort of pattern recognition that corresponds to speaker identification in the field of speech information processing.Instrument identification, however, is a more difficult problem than noiseless single-speaker identification because, in most musical pieces, multiple instruments simultaneously 2 EURASIP Journal on Advances in Signal Processing play.In fact, studies dealing with polyphonic music [7,[10][11][12][13] have used duo or trio music chosen from 3-5 instrument candidates, whereas those dealing with monophonic sounds [14][15][16][17][18][19][20][21][22][23] have used 10-30 instruments and achieved the performance of about 70-80%.Kashino and Murase [10] reported a performance of 88% for trio music played on piano, violin, and flute given the correct fundamental frequencies (F0s).Kinoshita et al. [11] reported recognition rates of around 70% (70-80% if the correct F0s were given).Eggink and Brown [13] reported a recognition rate of about 50% for duo music chosen from five instruments given the correct F0s.Although a new method that can deal with more complex musical signals has been proposed [24], it cannot be applied to score-based annotation such as MusicXML because the key idea behind this method is to identify instrumentation instead of instruments at each frame, not for each note.The main difficulty in identifying instruments in polyphonic music is the fact that acoustical features of each instrument cannot be extracted without blurring because of the overlapping of partials (harmonic components).If a clean sound for each instrument could be obtained using sound separation technology, the identification of polyphonic music would become equivalent to identifying the monophonic sound of each instrument.In practice, however, a mixture of sounds is difficult to separate without distortion.
In this paper, we approach the above-mentioned overlapping problem by weighting each feature based on how much the feature is affected by the overlapping.If we can give higher weights to features suffering less from this problem and lower weights to features suffering more, it will facilitate robust instrument identification in polyphonic music.To do this, we quantitatively evaluate the influence of the overlapping on each feature as the ratio of the withinclass variance to the between-class variance in the distribution of training data obtained from polyphonic sounds because greatly suffering from the overlapping means having large variation when polyphonic sounds are analyzed.This evaluation makes the feature weighting described above equivalent to dimensionality reduction using linear discriminant analysis (LDA) on training data obtained from polyphonic sounds.Because LDA generates feature axes using a weighted mixture where the weights minimize the ratio of the withinclass variance to the between-class variance, using LDA on training data obtained from polyphonic sounds generates a subspace where the influence of the overlapping problem is minimized.We call this method DAMS (discriminant analysis with mixed sounds).In previous studies, techniques such as time-domain waveform template matching [10], feature adaptation with manual feature classification [11], and the missing feature theory [12] have been tried to cope with the overlapping problem, but no attempts have been made to give features appropriate weights based on their robustness to the overlapping.
In addition, we propose a method for improving instrument identification using musical context.This method is aimed at avoiding musically unnatural errors by considering the temporal continuity of melodies; for example, if the identified instrument names of a note sequence are all "flute" except for one "clarinet," this exception can be considered an error and corrected.
The rest of this paper is organized as follow.In Section 2, we discuss how to achieve robust instrument identification in polyphonic music and propose our feature weighting method, DAMS.In Section 3, we propose a method for using musical context.Section 4 explains the details of our instrument identification method, and Section 5 reports the results of our experiments including those under various conditions that were not reported in [25].Finally, Section 6 concludes the paper.

INSTRUMENT IDENTIFICATION ROBUST TO OVERLAPPING OF SOUNDS
In this section, we discuss how to design an instrument identification method that is robust to the overlapping of sounds.First, we mention the general formulation of instrument identification.Then, we explain that extracting harmonic structures effectively suppresses the influence of other simultaneously played notes.Next, we point out that harmonic structure extraction is insufficient and we propose a method of feature weighting to improve the robustness.

General formulation of instrument identification
In our instrument identification methodology, the instrument for each note is identified.Suppose that a given audio signal contains K notes, n 1 , n 2 , . . ., n k , . . ., n K .The identification process has two basic subprocesses: feature extraction and a posteriori probability calculation.In the former process, a feature vector consisting of some acoustic features is extracted from the given audio signal for each note.Let x k be the feature vector extracted for note n k .In the latter process, for each of the target instruments, ω 1 , . . ., ω m , the probability p(ω i | x k ) that the feature vector x k is extracted from a sound of the instrument ω i is calculated.Based on the Bayes theorem, p(ω i | x k ) can be expanded as follows: where p(x k | ω i ) is a probability density function (PDF) and p(ω i ) is the a priori probability with respect to the instrument ω i .The PDF p(x k | ω i ) is trained using data prepared in advance.Finally, the name of the instrument maximizing p(ω i | x k ) is determined for each note n k .The symbols used in this paper are listed in Table 1.

Use of harmonic structure model
In speech recognition and speaker recognition studies, features of spectral envelopes such as Mel-frequency cepstrum coefficients are commonly used.Although they can reasonably represent the general shapes of observed spectra, when a signal of multiple instruments simultaneously playing is analyzed, focusing on the component corresponding to each instrument from the observed spectral envelope is difficult.
Because most musical sounds except percussive ones have harmonic structures, previous studies on instrument identification [7,9,11] have commonly extracted the harmonic structure of each note and then extracted acoustic features from the structures.We also extract the harmonic structure of each note and then extract acoustic features from the structure.The harmonic structure model H (n k ) of the note n k can be represented as the following equation: where F i (t) and A i (t) are the frequency and amplitude of the ith partial at time t.Frequency is represented by relative frequency where the temporal median of the fundamental frequency, F 1 (t), is 1.Above, h is the number of harmonics, and T is the note duration.This modeling of musical instrument sounds based on harmonic structures can restrict the influence of the overlapping of sounds of multiple instruments to the overlapping of partials.Although actual musical instrument sounds contain nonharmonic components, which can be factors characterizing sounds, we focus only on harmonic ones because nonharmonic ones are difficult to reliably extract from a mixture of sounds.

Feature weighting based on robustness to overlapping of sounds
As described in the previous section, the influence of the overlapping of sounds of multiple instruments is restricted to the overlapping of the partials by extracting the harmonic structures.If two notes have no partials with common frequencies, the influence of one on the other when the two notes are simultaneously played may be ignorably small.In practice, however, partials often overlap.When two notes with the pitches of C4 (about 262 Hz) and G4 (about 394 Hz) are simultaneously played, for example, the 3 ith partials of the C4 note and the 2 ith partials of the G4 note overlap for every natural number i.Because note combinations that can generate harmonious sounds cause overlaps in many partials in general, coping with the overlapping of partials is a serious problem.
One effective approach for coping with this overlapping problem is feature weighting based on the robustness to the overlapping problem.If we can give higher weights to features suffering less from this problem and lower weights to features suffering more, it will facilitate robust instrument identification in polyphonic music.Concepts similar to this feature weighting, in fact, have been proposed, such as the missing feature theory [12] and feature adaptation [11].
(i) Eggink and Brown [12] applied the missing feature theory to the problem of identifying instruments in polyphonic music.This is a technique for canceling unreliable features at the identification step using a vector called a mask, which represents whether each feature is reliable or not.Because masking a feature is equivalent to giving a weight of zero to it, this technique can be considered an implementation of the feature weighting concept.Although this technique is known to be effective if the features to be masked are given, automatic mask estimation is very difficult in general and has not yet been established.
(ii) Kinoshita et al. [11] proposed a feature adaptation method.They manually classified their features for identification into three types (additive, preferential, and fragile) according to how the features varied when partials overlapped.Their method recalculates or cancels the features extracted from overlapping components according to the three types.Similarly to Eggink's work, canceling features can be considered an implementation of the feature weighting concept.Because this method requires manually classifying features in advance, however, using a variety of features is difficult.They introduced a feature weighting technique, but this technique was performed on monophonic sounds, and hence did not cope with the overlapping problem.
(iii) Otherwise, there has been Kashino's work based on a time-domain waveform template-matching technique with adaptive template filtering [10].The aim was the robust matching of an observed waveform and a mixture of waveform templates by adaptively filtering the templates.This study, therefore, did not deal with feature weighting based on the influence of the overlapping problem.
The issue in the feature weighting described above is how to quantitatively design the influence of the overlapping problem.Because training data were obtained only from monophonic sounds in previous studies, this influence could not be evaluated by analyzing the training data.Our DAMS method quantitatively models the influence of the overlapping problem on each feature as the ratio of the within-class variance to the between-class variance in the distribution  of training data obtained from polyphonic sounds.As described in the introduction, this modeling makes weighting features to minimize the influence of the overlapping problem equivalent to applying LDA to training data obtained from polyphonic sounds.
Training data are obtained from polyphonic sounds through the process shown in Figure 1.The sound of each note in the training data is labeled in advance with the instrument name, the F0, the onset time, and the duration.By using these labels, we extract the harmonic structure corresponding to each note from the spectrogram.We then extract acoustic features from the harmonic structure.We thus obtain a set of many feature vectors, called a mixed-sound template, from polyphonic sound mixtures.
The main issue in constructing a mixed-sound template is to design an appropriate subset of polyphonic sound mixtures.This is a serious issue because there are an infinite number of possible combinations of musical sounds due to the large pitch range of each instrument. 1 The musical feature that is the key to resolving this issue is a tendency of intervals of simultaneous notes.In Western tonal music, some intervals such as minor 2nds are more rarely used than other intervals such as major 3rds and perfect 5ths because minor 2nds generate dissonant sounds in general.By generating polyphonic sounds for template construction from the scores of actual (existing) musical pieces, we can obtain a data set that reflects the tendency mentioned above. 2 We believe that this approach improves instrument identification even if the pieces used for template construction are different from the piece to be identified for the following two reasons.
(i) There are different distributions of intervals found in simultaneously sounding notes in tonal music.For example, (ii) Because we extract the harmonic structure from each note, as previously mentioned, the influence of multiple instruments simultaneously playing is restricted to the overlapping of partials.The overlapping of partials can be explained by two main factors: which partials are affected by other sounds, related to note combinations, and how much each partial is affected, mainly related to instrument combinations.Note combinations can be reduced because our method considers only relative-pitch relationships, and the lack of instrument combinations is not critical to recognition as we find in an experiment described below.If the intervals of note combinations in a training data set reflect those in actual music, therefore, the training data set will be effective despite a lack of other combinations.

USE OF MUSICAL CONTEXT
In this section, we propose a method for improving instrument identification by considering musical context.The aim of this method is to avoid unusual events in tonal music, for example, only one clarinet note appearing in a sequence of notes (a melody) played on a flute, as shown in Figure 2. As mentioned in Section 2.1, the a posteriori probability p( The key idea behind using musical context is to apply the a posteriori probabilities of n k 's temporally neighboring notes to the a priori probability p(ω i ) of the note n k (Figure 3).This is based on the idea that if almost all notes around the note n k are identified as the instrument ω i , n k is also probably played on ω i .To achieve this, we have to resolve the following issue.

Issue: distinguishing notes played on the same instrument as n k from neighboring notes
Because various instruments are played at the same time, an identification system has to distinguish notes that are played on the same instrument as the note n k from notes played on other instruments.This is not easy because it is mutually dependent on musical instrument identification.
We resolve this issue as follows.
Solution: take advantage of the parallel movement of simultaneous parts.
In Western tonal music, voices rarely cross.This may be explained due to the human's ability to recognize multiple voices easier if they do not cross each other in pitch [26].
When they listen, for example, to two simultaneous note sequences that cross, one of which is descending and the other of which is ascending, they cognize them as if the sequences approach each other but never cross.Huron also explains that the pitch-crossing rule (parts should not cross with respect to pitch) is a traditional voice-leading rule and can be derived from perceptual principles [27].We therefore judge whether two notes, n k and n j , are in the same part (i.e., played on the same instrument) as follows: let s h (n k ) and s l (n k ) be the maximum number of simultaneously played notes in the higher and lower pitch ranges when the note n k is being played.Then, the two notes n k and n j are considered to be in the same part if and only if s h (n k ) = s h (n j ) and s l (n k ) = s l (n j ) (Figure 4).Kashino and Murase [10] have introduced musical role consistency to generate music streams.
They have designed two kinds of musical roles: the highest and lowest notes (usually corresponding to the principal melody and bass lines).Our method can be considered an extension of their musical role consistency.

1st pass: precalculation of a posteriori probabilities
For each note n k , the a posteriori probability p(ω i | x k ) is calculated by considering the a priori probability p(ω i ) to be a constant because the a priori probability, which depends on the a posteriori probabilities of temporally neighboring notes, cannot be determined in this step.

2nd pass: recalculation of a posteriori probabilities
This pass consists of three steps. (

1) Finding notes played on the same instrument
Notes that satisfy are extracted from notes temporally neighboring n k .This extraction is performed from the nearest note to farther notes and stops when c notes have been extracted (c is a positive integral constant).Let N be the set of the extracted notes.
Assuming that the following notes are played on the same instrument. . .
A priori probability Calculated based on a posteriori probabilities of previous and following notes (

2) Calculating a priori probability
The a priori probability of the note n k is calculated based on the a posteriori probabilities of the notes extracted in the previous step.Let p 1 (ω i ) and p 2 (ω i ) be the a priori probabilities calculated from musical context and other cues, respectively.Then, we define the a priori probability p(ω i ) to be calculated here as follows: where λ is a confidence measure of musical context.Although this measure can be calculated through statistical analysis as the probability that the note n k will be played on instrument ω i when all the extracted neighboring notes of n k are played on ω i , we use λ = 1 − (1/2) c for simplicity, where c is the number of notes in N .This is based on the heuristics that as more notes are used to represent a context, the context information is more reliable.We define p 1 (ω i ) as follows: where x j is the feature vector for the note n j and α is the normalizing factor given by α = ωi nj p(ω i | x j ).We use p 2 (ω i ) = 1/m for simplicity. (

3) Updating a posteriori probability
The a posteriori probability is recalculated using the a priori probability calculated in the previous step.

DETAILS OF OUR INSTRUMENT IDENTIFICATION METHOD
The details of our instrument identification method are given below.An overview is shown in Figure 5. First, the spectrogram of a given audio signal is generated.Next, the (0, 2) (0, 2) (0, 2) (0, 2) (0, 2) (0, 2) (0, 2) (0, 1) (0, 1) (0, 1) (1, 1) (1, 1) (1, 1) A pair of notes that is correctly judged to be played on the same instrument A pair of notes that is not judged to be played on the same instrument although it actually is harmonic structure of each note is extracted based on data on the F0, the onset time, and the duration of each note, which are estimated in advance using an existing method (e.g., [7,9,28]).Then, feature extraction, dimensionality reduction, a posteriori probability calculation, and instrument determination are performed in that order.

Short-time Fourier transform
The spectrogram of the given audio signal is calculated using the short-time Fourier transform (STFT) shifted by 10 milliseconds (441 points at 44.1 kHz sampling) with an 8192point Hamming window.

Harmonic structure extraction
The harmonic structure of each note is extracted according to note data estimated in advance.Spectral peaks corresponding to the first 10 harmonics are extracted from the onset time to the offset time.The offset time is calculated by adding the duration to the onset time.Then, the frequency of the spectral peaks is normalized so that the temporal mean of F0 is 1.
Next, the harmonic structure is trimmed because training and identification require notes with fixed durations.Because a mixed-sound template with a long duration is more stable and robust than a template with a short one, trimming a note to keep it as long as possible is best.We therefore prepare three templates with different durations (300, 450, and 600 milliseconds), and the longest usable, as determined by the actual duration of each note, is automatically selected and used for training and identification. 3For example, the 450-millisecond template is selected for a 500-millisecond note.In this paper, the 300-milliseconds, 450-millisecond, and 600-millisecond templates are called Template Types I, II, and III.Notes shorter than 300 milliseconds are not identified.

Feature extraction
Features that are useful for identification are extracted from the harmonic structure of each note.From a feature set that we previously proposed [19], we selected 43 features (for Template Type III), summarized in Table 2, that we expected to be robust with respect to sound mixtures.We use 37 features for Template Type II and 31 for I because of the limitations of the note durations.

Dimensionality reduction
Using the DAMS method, the subspace minimizing the influence of the overlapping problem is obtained.Because a feature space should not be correlated to robustly perform the LDA calculation, before using the DAMS method, we obtain a noncorrelative space by using principal component analysis (PCA).The dimensions of the feature space obtained with PCA are determined so that the cumulative proportion value is 99% (20 dimensions in most cases).By using the DAMS method in this subspace, we obtain an (m − 1)-dimensional space (m: the number of instruments in the training data).

A posteriori probability calculation
For each note n k , the a posteriori probability p(ω i | x k ) is calculated.As described in Section 2.1, this probability can be calculated using the following equation: A Self-archived copy in Kyoto University Research Information Repository https://repository.kulib.kyoto-u.ac.jpThe PDF p(x k | ω i ) is calculated from training data prepared in advance by using an F0-dependent multivariate normal distribution, as it is defined in our previous paper [19].The F0-dependent multivariate normal distribution is designed to cope with the pitch dependency of features.It is specified by the following two parameters.

(i) F0-dependent mean function μ i ( f )
For each element of the feature vector, the pitch dependency of the distribution is approximated as a function (cubic polynomial) of F0 using the least-square method.

(ii) F0-normalized covariance Σ i
The F0-normalized covariance is calculated using the following equation:   where χ i is the set of the training data of instrument ω i , |χ i | is the size of χ i , f x denotes the F0 of feature vector x, and represents the transposition operator.Once these parameters are estimated, the PDF is given as where d is the number of dimensions of the feature space and D 2 is the squared Mahalanobis distance defined by The a priori probability p(ω i ) is calculated on the basis of the musical context, that is, the a posteriori probabilities of neighboring notes, as described in Section 3.

Instrument determination
Finally, the instrument maximizing the a posteriori probability p(ω i | x k ) is determined as the identification result for the note n k .

Data for experiments
We used audio signals generated by mixing audio data taken from a solo musical instrument sound database according to standard MIDI files (SMFs) so that we would have correct data on F0s, onset times, and durations of all notes because the focus of our experiments was solely on evaluating the performance of our instrument identification method by itself.
The SMFs we used in the experiments were three pieces taken from RWC-MDB-C-2001 (Piece Nos. 13, 16, and 17) [29].These are classical musical pieces consisting of four or five simultaneous voices.We created SMFs of duo, trio, and quartet music by choosing two, three, and four simultaneous voices from each piece.We also prepared solo-melody SMFs for template construction.
As audio sources for generating audio signals of duo, trio, and quartet music, an excerpt of RWC-MDB-I-2001 [30], listed in Table 3, was used.To avoid using the same audio data for training and testing, we used 011PFNOM, 151VN-NOM, 311CLNOM, and 331FLNOM for the test data and the others in Table 3 for the training data.We prepared audio signals of all possible instrument combinations within the restrictions in Table 4, which were defined by taking the pitch ranges of instruments into account.For example, 48 different combinations were made for quartet music.

Experiment 1: leave-one-out
The experiment was conducted using the leave-one-out cross-validation method.When evaluating a musical piece, a mixed-sound template was constructed using the remaining two pieces.Because we evaluated three pieces, we constructed three different mixed-sound templates by dropping the piece used for testing.The mixed-sound templates were constructed from audio signals of solo and duo music (S+D) and solo, duo, and trio music (S + D + T).For comparison, we also constructed a template, called a solo-sound template, only from solo musical sounds.The number of notes in each template is listed in Table 5.To evaluate the effectiveness of F0-dependent multivariate normal distributions and using musical context, we tested both cases with and without each technique.We fed the correct data on the F0s, onset times, and durations of all notes because our focus was on the performance of the instrument identification method alone.
The results are shown in Table 6.Each number in the table is the average of the recognition rates for the three pieces.Using the DAMS method, the F0-dependent multivariate normal distribution, and the musical context, we improved the recognition rates from 50.9 to 84.1% for duo, from 46.1 to 77.6% for trio, and from 43.1 to 72.3% for quartet music on average.
We confirmed the effect of each of the DAMS method (mixed-sound template), the F0-dependent multivariate normal distribution, and the musical context using (those for the trio and duo music are omitted but are basically the same as those for the quartet), where the χ 2 0 are test statistics.Because the criterion region at α = 0.001 (which is the level of significance) is (10.83,+∞), the differences except for S + D versus S + D + T are significant at α = 0.001.
Other observations are summarized as follows.(i) The results of the S+D and S+D+T templates were not significantly different even if the test data were from quartet music.This means that constructing a template from polyphonic sounds is effective even if the sounds used for the template construction do not have the same complexity as the piece to be identified.
(ii) For PF and CG, the F0-dependent multivariate normal distribution was particularly effective.This is because these instruments have large pitch dependencies due to their wide pitch ranges.
(iii) Using musical context improved recognition rates, on average, by approximately 10%.This is because, in the musical pieces used in our experiments, pitches in the melodies of simultaneous voices rarely crossed.
(iv) When the solo-sound template was used, the use of musical context lowered recognition rates, especially for CL.Because our method of using musical context calculates the a priori probability of each note on the basis of the a posteriori probabilities of temporally neighboring notes, it requires an accuracy sufficient for precalculating the a posteriori probabilities of the temporally neighboring notes.The lowered recognition rates are because of the insufficient accuracy of this precalculation.In fact, this phenomenon did not occur when the mixed-sound templates, which improved the accuracies of the precalculations, were used.Therefore, musical context should be used together with some technique of improving the pre-calculation accuracies, such as a mixedsound template.
(v) The recognition rate for PF was not high enough in some cases.This is because the timbre of PF is similar to that of CG.In fact, even humans had difficulty distinguishing them in listening tests of sounds resynthesized from harmonic structures extracted from PF and CG tones.

Experiment 2: template construction from only one piece
Next, to compare template construction from only one piece with that from two pieces (i.e., leave-one-out), we conducted an experiment on template construction from only one piece.The results are shown in Table 8.Even when using a template made from only one piece, we obtained comparatively high recognition rates for CG, VN, and CL.For FL, the results of constructing a template from only one piece were not high (e.g., 30-40%), but those from two pieces were close to the results of the case where the same piece was used for both template construction and testing.This means that a variety of influences of sounds overlapping was trained from only two pieces.

Experiment 3: insufficient instrument combinations
We investigated the relationship between the coverage of instrument combinations in a template and the recognition rate.When a template that does not cover instrument combinations is used, the recognition rate might decrease.If this  decrease is large, the number of target instruments of the template will be difficult to increase because O(m n ) data are needed for a full-combination template, where m and n are the number of target instruments and simultaneous voices.The purpose of this experiment is to check whether such a decrease occurs in the use of a reduced-combination template.As the reduced-combination template, we used one that contains the combinations listed in Table 9 only.These combinations were chosen so that the order of the combinations was O(m).Similarly to Experiment 1, we used the leave-one-out cross-validation method.As we can see from Table 10, we did not find significant differences between using the full instrument combinations and the reduced combinations.This was confirmed, as shown in

Experiment 4: effectiveness of LDA
Finally, we compared the dimensionality reduction using both PCA and LDA with that using only PCA to evaluate the effectiveness of LDA.The experimental method was leaveone-out cross-validation.The results are shown in Figure 6.
The difference between the recognition rates of the solosound template and the S + D or S + D + T template was 20-24% using PCA + LDA and 6-14% using PCA only.These results mean that LDA (or DAMS) successfully obtained a subspace where the influence of the overlapping of sounds of multiple instruments was minimal by minimizing the ratio of the within-class variance to the between-class variance.Under all conditions, using LDA was superior to not using LDA.
We confirmed that combining LDA and the mixed-sound template is effective using two-way factorial analysis of variance (ANOVA) where the two factors are dimensionality reduction methods (PCA only and PCA + LDA) and templates (S, S + D, and S + D + T).Because we tested each condition using duo, trio, and quartet versions of Piece Nos. 13, 16, and 17, there are nine results for each cell of the two-factor ma-   12. From the table, we can see that the interaction effect as well as the effects of dimensionality reduction methods and templates are significant at α = 0.001.This result means that mixed-sound templates are particularly effective when combined with LDA.

Application to XML annotation
In this section, we show an example of XML annotation of musical audio signals using our instrument identification method.We used a simplified version of MusicXML instead of the original MusicXML format because our method does not include rhythm recognition and hence cannot determine note values or measures.The document-type definition (DTD) of our simplified MusicXML is shown in Figure 7.
The main differences between it and the original one are that elements related to notation, which cannot be estimated from audio signals, are reduced and that time is represented in seconds.The result of XML annotation of a piece of polyphonic music is shown in Figure 8.By using our instrument identification method, we classified notes according to part and described the instrument for each part.

Discussion
We achieved average recognition rates of 84.1% for duo, 77.6% for trio, and 72.3% for quartet music chosen from five different instruments.We think that this performance is state of the art, but we cannot directly compare these rates with experimental results published by other researchers because different researchers used different test data in general.We also find the following two limitations in our evaluation: (1) the correct F0s are given; (2) nonrealistic music (i.e., music synthesized by mixing isolated monophonic sound samples) is used.
First, in most existing studies, including ours, the methods were tested under the condition that the correct F0s are manually fed [10,13].This is because the multiple <?xml version="1.0"encoding="UTF-8" standalone="no" ?> <!DOCTYPE score-partwise-simple SYSTEM "partwisesimple.dtd"><score-partwise-simple> <part-list> <score-part id="P1"> <part-name>Part 1</part-name> <score-instrument>Piano</score-instrument> </score-part> <score-part id="P2"> <part-name>Part 3</part-name> <score-instrument>Violin</score-instrument> </score-part>  F0-estimation for a sound mixture is still a challenging problem, and the studies aimed at evaluating the performance of only their instrument identification methods.If the estimated F0s are used instead of the manually given correct F0s, the performance of instrument identification will decrease.In fact, Kinoshita et al. [11] reported that given random note patterns taken from three different instruments, the instrument identification performance was around 72-81% for correct F0s but decreased to around 66-75% for estimated F0s.Because multiple-F0 estimation has actively been studied [8,31,32], we plan to integrate and evaluate our instrument identification method with such a multiple-F0 estimation method in the future.Second, most existing studies, including ours, used nonrealistic music as test samples.For example, Kashino et al. [7] and Kinoshita et al. [11] tested their methods on polyphonic musical audio signals that were synthesized by mixing isolated monophonic sounds of every target instrument on an MIDI sampler.This was because information on the instrument for every note that was used as correct references in the evaluation was then easy to prepare.Strictly speaking, however, the acoustical characteristics of real music are different from those of such synthesized music.The performance of our method would decrease for real music because legato play sometimes causes overlapping successive notes with unclear onsets in a melody and because sound mixtures often involve reverberations.We plan to manually annotate the correct F0 information for real music and evaluate our method after integrating it with a multiple-F0 estimation method as mentioned above.

CONCLUSION
We have provided a new solution to an important problem of instrument identification in polyphonic music: the overlapping of partials (harmonic components).Our solution is to weight features based on their robustness to overlapping by collecting training data extracted from polyphonic sounds and applying LDA to them.Although the approach of collecting training data from polyphonic sounds is simple, no previous studies have attempted it.One possible reason may be that a tremendously large amount of data is required to prepare a thorough training data set containing all possible sound combinations.From our experiments, however, we found that a thorough training data set is not necessary and that a data set extracted from a few musical pieces is sufficient to improve the robustness of instrument identification in polyphonic music.Furthermore, we improved the performance of the instrument identification using musical context.Our method made it possible to avoid musically unnatural errors by taking the temporal continuity of melodies into consideration.
Because the F0 and onset time of each note were given in our experiments to check the performance of only the instrument identification, we plan to complete MusicXML annotation by integrating our method with a musical note estimation method.Our future work will also include the use of the description of musical instrument names identified us-ing our method to build a music information retrieval system that enables users to search for polyphonic musical pieces by giving a query including musical instrument names.

Figure 1 :
Figure 1: Overview of process of constructing mixed-sound template.

Figure 2 :
Figure 2: Example of musically unnatural errors.This example is excerpted from results of identifying each note individually in a piece of trio music.Marked notes are musically unnatural errors, which can be avoided by using musical context.PF, VN, CL, and FL represent piano, violin, clarinet, and flute.

Figure 3 :
Figure3: Key idea for using musical context.To calculate a posteriori probability of note n k , a posteriori probabilities of temporally neighboring notes of n k are used.

Figure 4 :
Figure 4: Example of judgment of whether notes are played on the same instrument.Each tuple (a,b) represents s h (n k ) = a and s l (n k ) = b.

Figure 5 :
Figure 5: Flow of our instrument identification method.

Table 1 :
List of symbols.
n 1 , . . ., n K Notes contained in a given signal x k Feature vector for note n k ω 1 , . . ., ω m
to ith components (i = 2, 3, . .., 9) 11 Relative power in odd and even components 12 -20 Number of components whose durations are p% longer than the longest duration (p = 10, 20, . .., 90) * Average differential of power envelope during t-second interval from onset time (t = 0.15, 0.20, 0.25, . .., 0.55 (s))31 -39 * Ratio of power at t second after onset time * In Template Types I and II, some of these features have been excluded due to the limitations of the note durations.

Table 3 :
Audio data on solo instruments.

Table 4 :
Instrument candidates for each part.The abbreviations of instruments are defined in Table3.

Table 5 :
Number of notes in mixed-sound templates (Type I).Templates of Types II and III have about 1/2 and 1/3-1/4 times the notes of Type I (details are omitted due to a lack of space).S + D and S + D + T stand for the templates constructed from audio signals of solo and duo music, and from those of solo, duo, and trio music, respectively.
* Template used in Experiment III.

Table 6 :
Results of Experiment 1. : used, ×: not used; bold font denotes recognition rates of higher than 75%.
McNemar's test.McNemar's test is usable for testing whether the proportions of A-labeled ("correct" in this case) data to B-labeled ("incorrect") data under two different conditions are significantly different.Because the numbers of notes are different among instruments, we sampled 100 notes at random for each instrument to avoid the bias.The results of McNemar's test for the quartet music are listed in Table7

Table 8 :
Template construction from only one piece (Experiment 2).Quartet only due to lack of space (unit: %).Leave-one-out.Numbers in left column denote piece numbers for test, those in top row denote piece numbers for template construction.

Table 11 ,
through McNemar's test, similarly to Experiment 1.Therefore, we expect that the number of target instruments can be increased without the problem of combinational explosion.

Table 10 :
Comparison of templates whose instrument combinations were reduced (subset) and not reduced (full set).

Table 11 :
Results of McNemar's test for full-set and subset tem-
trix.The table of ANOVA is given in Table