- Research Article
- Open Access
Query-by-Example Music Information Retrieval by Score-Informed Source Separation and Remixing Technologies
© Katsutoshi Itoyama et al. 2010
- Received: 1 March 2010
- Accepted: 31 December 2010
- Published: 17 January 2011
We describe a novel query-by-example (QBE) approach in music information retrieval that allows a user to customize query examples by directly modifying the volume of different instrument parts. The underlying hypothesis of this approach is that the musical mood of retrieved results changes in relation to the volume balance of different instruments. On the basis of this hypothesis, we aim to clarify the relationship between the change in the volume balance of a query and the genre of the retrieved pieces, called genre classification shift. Such an understanding would allow us to instruct users in how to generate alternative queries without finding other appropriate pieces. Our QBE system first separates all instrument parts from the audio signal of a piece with the help of its musical score, and then it allows users remix these parts to change the acoustic features that represent the musical mood of the piece. Experimental results showed that the genre classification shift was actually caused by the volume change in the vocal, guitar, and drum parts.
- Gaussian Mixture Model
- Audio Signal
- Source Separation
- Nonnegative Matrix Factorization
- Musical Piece
One of the most promising approaches in music information retrieval is query-by-example (QBE) retrieval [1–7], where a user can receive a list of musical pieces ranked by their similarity to a musical piece (example) that the user gives as a query. This approach is powerful and useful, but the user has to prepare or find examples of favorite pieces, and it is sometimes difficult to control or change the retrieved pieces after seeing them because another appropriate example should be found and given to get better results. For example, even if a user feels that vocal or drum sounds are too strong in the retrieved pieces, it is difficult to find another piece that has weaker vocal or drum sounds while maintaining the basic mood and timbre of the first piece. Since finding such music pieces is now a matter of trial and error, we need more direct and convenient methods for QBE. Here we assume that QBE retrieval system takes audio inputs and treat low-level acoustic features (e.g., Mel-frequency cepstral coefficients, spectral gradient, etc.).
We solve this inefficiency by allowing a user to create new query examples for QBE by remixing existing musical pieces, that is, changing the volume balance of the instruments. To obtain the desired retrieved results, the user can easily give alternative queries by changing the volume balance from the piece's original balance. For example, the above problem can be solved by customizing a query example so that the volume of the vocal or drum sounds is decreased. To remix an existing musical piece, we use an original sound source separation method that decomposes the audio signal of a musical piece into different instrument parts on the basis of its musical score. To measure the similarity between the remixed query and each piece in a database, we use the Earth Movers Distance (EMD) between their Gaussian Mixture Models (GMMs). The GMM for each piece is obtained by modeling the distribution of the original acoustic features, which consist of intensity and timbre.
The underlying hypothesis is that changing the volume balance of different instrument parts in a query grows diversity of the retrieved pieces. To confirm this hypothesis, we focus on the musical genre since musical diversity and musical genre have a certain level of relationship. A music database that consists of various genre pieces is suitable for the purpose. We define the term genre classification shift as the change of musical genres in the retrieved pieces. We target genres that are mostly defined by organization and volume balance of musical instruments, such as classical music, jazz, and rock. We exclude genres that are defined by specific rhythm patterns and singing style, e.g., waltz and hip hop. Note that this does not mean that the genre of the query piece itself can be changed. Based on this hypothesis, our research focuses on clarifying the relationship between the volume change of different instrument parts and the shift in the musical genre of retrieved pieces in order to instruct a user in how to easily generate alternative queries. To clarify this relationship, we conducted three different experiments. The first experiment examined how much change in the volume of a single instrument part is needed to cause a genre classification shift using our QBE retrieval system. The second experiment examined how the volume change of two instrument parts (a two-instrument combination for volume change) cooperatively affects the shift in genre classification. This relationship is explored by examining the genre distribution of the retrieved pieces. These experimental results show that the desired genre classification shift in the QBE results was easily achieved by simply changing the volume balance of different instruments in the query. The third experiment examined how the source separation performance affects the shift. The retrieved pieces using sounds separated by our method are compared with those using original sounds before mixing down in producing musical pieces. The experimental result showed that the separation performance for predictable feature shifts depends on an instrument part.
In this section, we describe our QBE retrieval system for retrieving musical pieces based on the similarity of mood between musical pieces.
2.1. Genre Classification Shift
2.2. Acoustic Feature Extraction
Acoustic features representing musical mood.
Acoustic intensity features
Intensity of each subband*
Acoustic timbre features
Spectral peak of each subband*
Spectral valley of each subband*
Spectral contrast of each subband*
2.2.1. Acoustic Intensity Features
where is the number of subbands, which is set to 7 in our experiments. These filter banks cannot be constructed because they have ideal frequency response; we implemented these by division and sum of the power spectrogram.
2.2.2. Acoustic Timbre Features
2.3. Similarity Calculation
Our QBE retrieval system needs to calculate the similarity between musical pieces, that is, a query example and each piece in a database, on the basis of the overall mood of the piece.
To model the mood of each piece, we use a Gaussian Mixture Model (GMM) that approximates the distribution of acoustic features. We set the number of mixtures to 8 empirically, although a previous study  used a GMM with 16 mixtures since we used smaller database than that study for experimental evaluation. Although the dimension of the obtained acoustic features was 33, it was reduced to 9 by using the principal component analysis where the cumulative percentage of eigenvalues was 0.95.
To measure the similarity among feature distributions, we utilized Earth Movers Distance (EMD) . The EMD is based on the minimal cost needed to transform one distribution into another one.
As mentioned in Section 1, musical audio signals should be separated into instrument parts beforehand to boost and reduce the volume of those parts. Although a number of sound source separation methods [11–14] have been studied, most of them still focus on dealing with music performed on either pitched instruments that have harmonic sounds or drums that have inharmonic sounds. For example, most separation methods for harmonic sounds [11–14] cannot separate inharmonic sounds, while most separation methods for inharmonic sounds, such as drums , cannot separate harmonic ones. Sound source separation methods based on the stochastic properties of audio signals, for example, independent component analysis and sparse coding [16–18], treat particular kind of audio signals which are recorded with a microphone array or have small number of simultaneously voiced musical notes. However, these methods cannot separate complex audio signals such as commercial CD recordings. We describe our sound source separation method which can separate complex audio signals with both harmonic and inharmonic sounds in this section.
The input and output of our method are described as follows:
input power spectrogram of a musical piece and its musical score (standard MIDI file); standard MIDI files for famous songs are often available thanks to Karaoke applications; we assume the spectrogram and the score have already been aligned (synchronized) by using another method;
output decomposed spectrograms that correspond to each instrument.
To separate the power spectrogram, we approximate the power spectrogram which is purely additive. By playing back each track of the SMF on a MIDI sound module, we prepared a sampled sound for each note. We call this a template sound and used it as prior information (and initial values) in the separation. The musical audio signal corresponding to the decomposed power spectrogram is obtained by using the inverse short-time Fourier transform with the phase of the input spectrogram.
In this section, we first define the problem of separating sound sources and the integrated tone model. This model is based on a previous study , and we improved implementation of the inharmonic models. We then derive an iterative algorithm that consists of two steps: sound source separation and model parameter estimation.
3.1. Integrated Tone Model of Harmonic and Inharmonic Models
Separating the sound source means decomposing the input power spectrogram, , into a power spectrogram that corresponds to each musical note, where and are the time and the frequency, respectively. We assume that includes musical instruments and the th instrument performs musical notes.
The harmonic tone model, , is defined as a constrained two-dimensional Gaussian Mixture Model (GMM), which is a product of two one-dimensional GMMs, and . This model is designed by referring to the HTC source model . Analogously, the inharmonic tone model, , is defined as a constrained two-dimensional GMM that is a product of two one-dimensional GMMs, and . The temporal structures of these tone models, and , are defined as an identical mathematical formula, but the frequency structures, and , are defined as different forms. In the previous study , the inharmonic models are implemented in a nonparametric way. We changed the inharmonic model by implementing in a parametric way. This change improves generalization of the integrated tone model, for example, timbre modeling and extension to a bayesian estimation.
Parameters of integrated tone model.
Relative amplitude of harmonic and inharmonic tone models
Amplitude coefficient of temporal power envelope for harmonic tone model
Relative amplitude of the th harmonic component
Amplitude coefficient of temporal power envelope for inharmonic tone model
Relative amplitude of the th inharmonic component
Diffusion of temporal power envelope for harmonic tone model
Diffusion of temporal power envelope for inharmonic tone model
F0 of harmonic tone model
Diffusion of harmonic components along frequency axis
Coefficients that determine the arrangement of the frequency structure of inharmonic model
3.2. Iterative Separation Algorithm
and decomposed spectrograms, that is, separated sounds, on the basis of the parameters of the tone models.
We can prevent the overtraining of the models by gradually increasing from 0 (i.e., the estimated model should first be close to the template spectrogram) through the iteration of the separation and adaptation (model estimation). The parameter update equations are derived by minimizing . We experimentally set to 0.0, 0.25, 0.5, 0.75, and 1.0 in sequence and 50 iterations are sufficient for parameter convergence with each alpha value. Note that this modification of the objective function has no direct effect on the calculation of the distribution functions since the modification never changes the relationship between the model and the distribution function in the objective function. For all values, the optimal distribution functions are calculated from only the models written in (21). Since the model parameters are changed by the modification, the distribution functions are also changed indirectly. The parameter update equations are described in the appendix.
We obtain an iterative algorithm that consists of two steps: calculating the distribution function while the model parameters are fixed and updating the parameters under the distribution function. This iterative algorithm is equivalent to the Expectation-Maximization (EM) algorithm on the basis of the maximum a posteriori estimation. This fact ensures the local convergence of the model parameter estimation.
We conducted two experiments to explore the relationship between instrument volume balances and genres. Given the query musical piece in which the volume balance is changed, the genres of the retrieved musical pieces are investigated. Furthermore, we conducted an experiment to explore the influence of the source separation performance on this relationship, by comparing the retrieved musical pieces using clean audio signals before mixing down (original) and separated signals (separated).
Number of musical pieces for each genre.
Number of pieces
played in all 10 musical pieces for the query,
played for more than 60% of the duration of each piece.
At http://winnie.kuis.kyoto-u.ac.jp/~itoyama/qbe/, sound examples of remixed signals and retrieved results are available.
4.1. Volume Change of Single Instrument
The results in Figure 6 clearly show that the genre classification shift occurred by changing the volume of any instrument part. Note that the genre of the retrieved pieces at 0 dB (giving the original queries without any changes) is the same for all three Figures 6(a), 6(b), and 6(c). Although we used 10 popular songs excerpted from the RWC Music Database: Popular Music for the queries, they are considered to be rock music as the genre with the highest similarity at 0 dB because those songs actually have the true rock flavor with strong guitar and drum sounds.
By increasing the volume of the vocal from −20 dB, the genre with the highest similarity shifted from rock (−20 to 4 dB) to popular (5 to 9 dB) and to jazz (10 to 20 dB) as shown in Figure 6(a). By changing the volume of the guitar, the genre shifted from rock (−20 to 7 dB) to popular (8 to 20 dB) as shown in Figure 6(b). Although it was commonly observed that the genre shifted from rock to popular in both cases of vocal and guitar, the genre shifted to jazz only in the case of vocal. These results indicate that the vocal and guitar would have different importance in jazz music. By changing the volume of the drums, genres shifted from popular (−20 to −7 dB) to rock (−6 to 4 dB) and to dance (5 to 20 dB) as shown in Figure 6(c). These results indicate a reasonable relationship between the instrument volume balance and the genre classification shift, and this relationship is consistent with typical impressions of musical genres.
4.2. Volume Change of Two Instruments (Pair)
Although the basic tendency in the genre classification shifts is similar to the single instrument experiment, classical music, which does not appear as the genre with the highest similarity in Figure 6, appears in Figure 7(b) when the vocal part is boosted and the drum part is reduced. The similarity of rock music decreased when we separately boosted either the guitar or the drums, but it is interesting that rock music can keep the highest similarity if both the guitar and drums are boosted together as shown in Figure 7(c). This result closely matched with the typical impression of rock music, and it suggests promising possibilities for this technique as a tool for customizing the query for QBE retrieval.
4.3. Comparison between Original and Separated Sounds
By changing the volume of the drums, the EMDs plotted in Figure 8(c) have similar curves in both of the original and separated conditions. On the other hand, by changing the volume of the guitar, the EMDs plotted in Figure 8(b) showed that a curve of the original condition is different from a curve of the separation condition. This result indicates that the shifts of features in those conditions were different. Average source separation performance of the guitar part was −1.77 dB, which was a lower value than those of vocal and drum parts. Noises included in the separated sounds of the guitar part induced this difference. By changing the volume of the vocal, the plotted EMDs of popular and dance pieces have similar curves, but the EMDs of jazz pieces have different curves, although the average source separation performance of the vocal part is the highest among these three instrument parts. This result indicates that the separation performance for predictable feature shifts depends on the instrument part.
The aim of this paper is achieving a QBE approach which can retrieve diverse musical pieces by boosting or reducing the volume balance of the instruments. To confirm the performance of the QBE approach, evaluation using a music database which has wide variations is necessary. A music database that consists of various genre pieces is suitable for the purpose. We defined the term genre classification shift as the change of musical genres in the retrieved pieces since we focus on the diversity of the retrieved pieces not on musical genre change of the query example.
More evidences of our QBE approach by subjective experiments are needed whether the QBE retrieval system can help users search better results.
In our experiments, we used only popular musical pieces as query examples. Remixing query examples except popular pieces can shift genres of retrieved results.
For source separation, we use the MIDI representation of a musical signal. Mixed and separated musical signals contain variable features: timbre difference from musical instruments' individuality, characteristic performances of instrument players such as vibrato, and environments such as room reverberation and sound effects. These features can be controlled implicitly by changing the volume of musical instruments and therefore QBE systems can retrieve various musical pieces. Since MIDI representations do not contain these features, diversity of retrieved musical pieces will decrease and users cannot evaluate the mood difference of the pieces if we use only musical signals which are synthesized from MIDI representations.
In the experiments, we used precisely synchronized SMFs at most 50 milliseconds of onset timing error. In general, synchronization between CD recordings and their MIDI representations is not enough for separation. Previous studies on audio-to-MIDI synchronization methods [23, 24] can help this problem. We experimentally confirmed that onset timing error under 200 milliseconds does not decrease source separation performance. Another problem is that the proposed separation method needs a complete musical score with melody and accompaniment instruments. A study of source separation method with a MIDI representation of specified instrument part  will help solving the accompaniment problem.
In this paper, we aimed to analyze and decompose a mixture of harmonic and inharmonic sounds by appending the inharmonic model to the harmonic model. To achieve this, a requirement must be satisfied: one-to-one basis-source mapping based on structured and parameterized source model. The HTC source model , on which our integrated model is based, satisfies the requirement. Adaptive harmonic spectral decomposition  has modeled a harmonic structure in a different way. They are suitable for multiple-pitch analysis and applied to polyphonic music transcription. On the other hand, the nonnegative matrix factorization (NMF) is usually used for separating musical instrument sounds and extracting simple repeating patterns [27, 28] and only approximates complex audio mixture since the one-to-one mapping is uncertified. Efficient feature extraction from complex audio mixtures will be promising by combining lower-order analysis using structured models such as the HTC and higher-order analysis using unconstrained models such as the NMF.
We have described how musical genres of retrieved pieces shift by changing the volume of separated instrument parts and explained a QBE retrieval approach on the basis of such genre classification shift. This approach is important because it was not possible for a user to customize the QBE query in the past, which required the user to always find different pieces to obtain different retrieved results. By using the genre classification shift based on our original sound source separation method, it becomes easy and intuitive to customize the QBE query by simply changing the volume of instrument parts. Experimental results confirmed our hypothesis that the musical genre shifts in relation to the volume balance of instruments.
Although the current genre shift depends on only the volume balance, other factors such as rhythm patterns, sound effects, and chord progressions would also be useful for causing the shift if we could control them. In the future, we plan to pursue the promising approach proposed in this paper and develop a better QBE retrieval system that easily reflects the user's intention and preferences.
This research was partially supported by the Ministry of Education, Science, Sports and Culture, a Grant-in-Aid for Scientific Research of Priority Areas, the Primordial Knowledge Model Core of Global COE program, and the JST CrestMuse Project.
- Rauber A, Pampalk E, Merkl D: Using psycho-acoustic models and self-organizingmaps to create a hierarchical structuring of music bysound similarity. Proceedings of the International Conference on Music Information Retrieval (ISMIR '02), 2002 71-80.Google Scholar
- Yang CC: The MACSIS acoustic indexingframework for music retrieval: an experimental study. Proceedings of the International Conference on Music Information Retrieval (ISMIR '02), 2002 53-62.Google Scholar
- Allamanche E, Herre J, Hellmuth O, Kastner T, Ertel C: A multiple feature model for musical similarity retrieval. Proceedings of the International Conference on Music Information Retrieval (ISMIR '03), 2003 217-218.Google Scholar
- Feng Y, Zhuang Y, Pan Y: Music information retrieval by detecting mood viacomputational media aesthetics. Proceedings of the International Conference on Web Intelligence (WI '03), 2003 235-241.View ArticleGoogle Scholar
- Thoshkahna B, Ramakrishnan KR: Projektquebex: a query by example system for audioretrieval. Proceedings of the International Conference on Multimedia and Expo (ICME '05), 2005 265-268.Google Scholar
- Vignoli F, Pauws S: A music retrievalsystem based on user-driven similarity and its evaluation. Proceedings of the International Conference on Music Information Retrieval (ISMIR '05), 2005 272-279.Google Scholar
- Kitahara T, Goto M, Komatani K, Ogata T, Okuno HG: Musical instrument recognizer "instrogram" and its application to music retrieval based on instrumentation similarity. Proceedings of the Annual International Supply Management Conference (ISM '06), 2006 265-274.Google Scholar
- Lu L, Liu D, Zhang HJ: Automatic mood detection and tracking of music audio signals. IEEE Transactions on Audio, Speech and Language Processing 2006, 14(1):5-18.MathSciNetView ArticleGoogle Scholar
- Jiang D-N, Lu L, Zhang H-J, Tao J-H, Cai L-H: Music type classification by spectral contrast features. Proceedings of the International Conference on Multimedia and Expo (ICME '02), 2002 113-116.View ArticleGoogle Scholar
- Rubner Y, Tomasi C, Guibas LJ: A metric for distributions with applications to image databases. Proceedings of the International Conference On Computer Vision (ICCV '98), 1998 59-66.Google Scholar
- Virtanen T, Klapuri A: Separation of harmonic sounds using linear models for the overtone series. Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP '02), 2002 2: 1757-1760.Google Scholar
- Every MR, Szymanski JE: A spectralfiltering approach to music signal separation. Proceedings of the Conference on Digital Audio Effects (DAFx '04), 2004 197-200.Google Scholar
- Woodruff J, Pardo P, Dannenberg R: Remixing stereo music with score-informed source separation. Proceedings of the International Conference on Music Information Retrieval (ISMIR '06), 2006 314-319.Google Scholar
- Viste H, Evangelista G: A method for separation of overlapping partials based on similarity of temporal envelopes in multichannel mixtures. IEEE Transactions on Audio, Speech and Language Processing 2006, 14(3):1051-1061.View ArticleGoogle Scholar
- Barry D, Fitzgerald D, Coyle E, Lawlor B: Drum source separation using percussive feature detection and spectral modulation. Proceedings of the Irish Signals and Systems Conference (ISSC '05), 2005 13-17.Google Scholar
- Saruwatari H, Kurita S, Takeda K, Itakura F, Nishikawa T, Shikano K: Blind source separation combining independent component analysis and beamforming. EURASIP Journal on Applied Signal Processing 2003, 2003(11):1135-1146. 10.1155/S1110865703305104View ArticleMATHGoogle Scholar
- Casey MA, Westner A: Separation of mixed audio sources by independent subspace analysis. Proceedings of the International Computer Music Conference (ICMC '00), 2000 154-161.Google Scholar
- Plumbley MD, Abdallah SA, Bello JP, Davies ME, Monti G, Sandler MB: Automatic music transcription and audio source separation. Cybernetics and Systems 2002, 33(6):603-627. 10.1080/01969720290040777View ArticleGoogle Scholar
- Itoyama K, Goto M, Komatani K, Ogata T, Okuno HG: Integration and adaptation of harmonic and inharmonic models for separating polyphonic musical signals. Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP '07), 2007 57-60.Google Scholar
- Kameoka H, Nishimoto T, Sagayama S: A multipitch analyzer based on harmonic temporal structured clustering. IEEE Transactions on Audio, Speech and Language Processing 2007, 15(3):982-994.View ArticleGoogle Scholar
- Goto M, Hashiguchi H, Nishimura T, Oka R: RWC music database: popular, classical, and jazz music databases. Proceedings of the International Conference on Music Information Retrieval (ISMIR '02), 2002 287-288.Google Scholar
- Goto M: AIST annotation for the RWC music database. Proceedings of the International Conference on Music Information Retrieval (ISMIR '06), 2006 359-360.Google Scholar
- Turetsky RJ, Ellis DPW: Groundtruth transcriptions of real music from force-aligned MIDI synthesis. Proceedings of the International Conference on Music Information Retrieval (ISMIR '03), 2003Google Scholar
- Muller M: Information Retrieval for Musicand Motion. Springer, Berlin, Germany; 2007.View ArticleGoogle Scholar
- Yasuraoka N, Abe T, Itoyama K, Komatani K, Ogata T, Hiroshi G: Changing timbre and phrase in existing musical performances as you like. Proceedings of the ACM International Conference on Multimedia (ACM-MM '09), 2009 203-212.View ArticleGoogle Scholar
- Vincent E, Bertin N, Badeau R: Adaptive harmonic spectral decomposition for multiple pitch estimation. IEEE Transactions on Audio, Speech and Language Processing 2010, 18(3):528-537.View ArticleGoogle Scholar
- Schmidt MN, Mørup M: Nonnegative matrix factor 2-D deconvolution for blind single channel source separation. Proceedings of the International Workshop on Independent Component Analysis and Signal Separation (ICA '06), April 2006 700-707.View ArticleGoogle Scholar
- Smaragdis P: Convolutive speech bases and their application to supervised speech separation. IEEE Transactions on Audio, Speech and Language Processing 2007, 15(1):1-12.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.