Music Genre Classification Using MIDI and Audio Features

We report our findings on using MIDI files and audio features from MIDI, separately and combined together, for MIDI music genre classification. We use McKay’s 3 root and 9 leaf genre data set. Although MIDI and audio from MIDI alone achieve much smaller accuracy than results reported by McKay, combining those classifiers improves accuracy. Successful classifier combination requires diversity of the base classifiers. We achieve diversity through using certain number of seconds of the MIDI file, different sample rates and sizes for the audio file and different classification algorithms


Introduction
The increase of the musical databases on the internet and multimedia systems have brought a great demand for Music Information Retrieval (MIR) applications and especially automatic analysis of the musical databases.Most of the current databases are indexed based on song title or artist name, where improper indexing can result in an incorrect search.More effective systems extract important features from audio and classify the audio to its genre based on these features.This kind of music retrieval systems should also have the ability to find similar songs based on their extracted features.However there are not any strict distinguishing boundaries between audio genres and no complete agreement exists in their definition [1,2].
Generally audio signal can be represented in two ways on computers.The first one is symbolic representation based on musical scores.Examples of this representation are MIDI and Humdrum where for each note, pitch, duration (start time/end time) and strength is kept in the file.The second one is based on acoustic signals, recording the audio intensity as a function of time based on a certain frequency and can be in compressed or uncompressed domain [3].Note that because of the difference of the representation of symbolic and acoustic data music retrieval algorithms that deal with these data are very different.
Previous work that deals with genre classification, uses symbolic representations of music such as MIDI files [2,4,5] or audio, such as wav or mp3 files [6,7].Most of the proposed methods have two processing steps.The first one is frame based feature extraction step of acoustic data where feature vectors of low-level descriptors are computed from each frame.Most common features used for genre classification of acoustic data are; timbre related, rhythm related and pitch related features [8].Timbre related features are; FFT coefficients, Cepstrum and Mel Frequency Cepstrum Coefficients (MFCC), Linear Prediction (LP) coefficients, MPEG filterbank components, Spectral Centroid, Spectral Flux, Zero Crossing Rate, Spectral Roll-Off, Low order statistics and Delta coefficients [6].More detailed descriptions of the features can be found in [6,9,8].In the second step pattern recognition algorithms are applied on the feature vectors to infer genre [6].
MIDI format developed as a standard to play music on instruments or computer.The sound quality of a MIDI music piece depends on the synthesizer (sound card) and MIDI has its other limitations, such as it can not store voice.On the other hand, this format takes a lot less space, hence it is much easier to store and communicate, is widely accepted and allows for better comparison between music pieces played on different instruments.Music retrieval problem for symbolic data generally becomes similar to a string matching problem and can be solved by several text searching algorithms [3].
Recently Vitanyi and his colleagues [10,11] have suggested using an approximation to Kolmogorov distance between two musical pieces as a means to compute clusters of music.They first process the MIDI representation of a music piece to turn it into a string from a finite alphabet.Then they compute the distance between two music pieces using their Normalized Compression Distance (NCD).NCD uses the compressed length of a string as an approximation to its Kolmogorov Complexity.Although the Kolmogorov Complexity of a string is not computable, the compressed length approximation seems to have given good results for musical genre and composer clustering and other data sets [12].[4] has reported very good root (98%) and leaf (90%) genre classification accuracy on his 3 root and 9 leaf genre dataset of 225 MIDI music pieces.We use the same data set in our experiments.We first train classifiers for MIDI genre classification.We produce audio files from MIDI files and then use audio to determine the genres.We combine MIDI and audio classifiers to achieve better accuracy.

Previously Cory McKay
We use our pre-processing method [13] of MIDI files, compute NCD distance between them using complearn software [www.complearn.org] and then k-Nearest Neighbour classifiers to predict root and leaf genre of MIDI files.In order to achieve classifier diversity, we train four different MIDI classifiers, using the first 30 seconds, 60 seconds, 120 seconds of the pieces only and also using the whole piece.

Classifiers
Many classification techniques have been used for genre classification.Examples are: Gaussian mixture models [8], support vector machines [14,15], radial basis functions [16], linear discriminant analysis [15] and k-nearest neighbors [15].In this study, we report our experiments with Linear Discriminant classifiers (LDC) which assume normal densities and k-nearest neighbor classifiers (KNN).We also have experimented with Quadratic Discriminant Classifiers (QDC), Fisher Linear Discriminant (Fisher), Naïve Bayes Classifier (NBC) and Parzen Density Based Classifier (PDC).However, since they gave as good results and for their simplicity, in this study, we report our experiments using LDC and KNN.We give brief descriptions of LDC and KNN classifiers below and refer the reader to [17] for more information.

Linear Bayes Classifier (LDC):
The objective of the linear discriminant analysis is to find sets of hyperplanes separating classes.LDC is a linear classifier assuming normal densities with equal covariance matrices.Fisher's LDA performs dimensionality reduction while preserving the class discriminatory information.k-Nearest Neighbor (KNN): is a well known nonparametric classifier.The training data is stored with their labels.A new input x is classified according to the labels of its closest (according to a distance metric) k neighbors in the training set.The value of k affects the complexity of the classifier.In our experiments we use k=10 (10NN).

Genre Classification Using Audio Features
Several feature extraction methods including low-level parameters such as zero-crossing rate, signal bandwidth, spectral centroid, root mean square level, band energy ratio, delta spectrum, psychoacoustic features, MFCC and Auditory filterbank temporal envelopes have been employed for audio classification [8].We have obtained the following content based audio features using Tzanetakis's Marsyas software.

Timbral Features
Timbral features are generally used for music-speech discrimination and speech recognition.They differentiate mixture of sounds with the same or similar rhythmic content.In order to extract the timbral features audio signal is divided into small intervals that can be acceptable as stationary signal.The following timbral features are calculated for these small intervals: • Spectral Centroid: Measures the spectral brightness and is defined as the center of the gravity of the magnitude spectrum of the STFT.• Spectral Rolloff: Measures the spectral shape and is defined as the frequency value below which lies the 85% of the magnitude distribution.• Spectral Flux: Measures the amount of local spectral change and is defined as the squared difference between the normalized magnitudes of successive spectral distributions.• Time Domain Zero Crossing: Measures the noisiness of the signal and is defined as the number of time domain zero crossings of the signal.• Low Energy: Measures the amplitude distribution of the signal and is defined as the percentage of the frames that have RMS energy less than the average RMS energy over the whole signal.• Mel-Frequency Cepstral Coefficients (MFCC): MFCCs are well known for speech representation.They are calculated by taking the log-amplitude of the magnitude spectrum and then smoothing the grouped FFT bins according to the perceptually motivated Mel-frequency scaling.Means and the variances of the spectral centroid, spectral rolloff, spectral flux, zero crossing and low energy [8] results in 9 dimensional feature vector and represented in experimental results as STFT label.Means and variances of the first five MFCC coefficients yield a 10 dimensional feature vector, which is represented as MFCC in the experiments.

Rhythmic Content Features
Rhythmic content features characterize the movement of music signals over time and contain such information as the regularity of the rhythm, the beat, the tempo, and the time signature [8,18].The feature set for representing rhythm structure is based on detecting the most salient periodicities of the signal.Rhythmic content features are calculated by beat histogram calculation and yield a 6 dimensional feature vector which is represented using BEAT label.

Pitch Content Features
The melody and harmony information about the music signal is obtained by pitch detection techniques.Although musical genres by no means can be characterized fully by their pitch content, there are certain tendencies that can lead to useful feature vectors [8].Pitch content features are calculated by pitch histogram calculation and yield a 5 dimensional feature vector which is represented as MPITCH in the experimental results.

Effect of Sample Rate and Size on Genre Classification
When an audio file is compressed under different settings, its features could change.In order to understand this, we used different sample rates (22.050kHz, 32 kHz, 44.1 kHz), sample sizes (8 bit, 16 bit) to convert the audio file to wav format.As seen in Figure 1, we examined the normalized mean difference between features on all data points using one setting versus another settting.There is some variability on all the features, although features 6 (BEAT), 7, 8 and 10 (STFT) seem to vary more than others.

Genre Classification Using MIDI and NCD
One way to measure distance between two music pieces is to first extract features and then measure distance between feature vectors.For example, McKay [4] uses twenty features of musical information such as orchestration, number of instruments, adjacent fifths etc.. Once distances are available, a classification algorithm, such as k-Nearest Neighbor, can be used to predict genre of a music piece.
In this study, in order to measure the distance between two music pieces, we use Normalized Compression Distance (NCD).According to NCD, two objects are said to be close if the information contained in one of them can be compressed in the other.In other words, if two pieces are similar, then it is possible to describe one given the other.The compression is based on the ideal mathematical notion of Kolmogorov complexity, which unfortunately is not effectively computable.However it is possible to approximate the Kolmogorov complexity by using standard compression techniques.NCD uses no background knowledge about music, it is completely general and can, without change, be used in different areas like linguistic classification and genomics.
Vitanyi and his colleagues [10,11], first process the MIDI representation of a music piece to turn it into a string from a finite alphabet.Then they compute the distance between two music pieces x and y using their Normalized Compression Distance (NCD): In this formula K(x) denotes the Kolmogorov Complexity of x and NCD uses the compressed length of a string as an approximation of its Kolmogorov Complexity.!K(xy) is computed simply as the compressed length of x and y concatenated together.This compressed length approximation to Kolmogorov Complexity seems to have given good results for musical genre and composer clustering and other problems in [12].
In this study, we use our preprocessor [13] on MIDI files to turn them into strings.The MIDI preprocessor samples the MIDI file at each 5 ms.and discovers the notes simultaneously played at each interval.It converts each note played in that interval to an integer between 0 and 127.Since all pieces used in experiments are polyphonic, like in most of the cases in the real world, polyphonic to monophonic conversion is needed.The note which is heard as the highest pitch [19] is taken as the representative of the interval.Then the difference between consecutive monophonic notes is taken and written to a binary file.Apart from Vitanyi and his colleagues work, tempo variations are taken into account and difference between consecutive monophonic notes is taken.Like them we use NCD as the distance metric between two pieces.
Table 2 shows the root and leaf genre classification accuracy of the 10NN classifier together with NCD distances.Distances are computed using the first 30, 60, 120 seconds and finally using all the available music piece.The accuracies shown are computed over 100 different train/test paritions of all the available data.Using the whole piece results in the best root genre classification performance, while using only the first 120 seconds results in the best leaf genre classification performance.Note that, as in the case of previous section, the root and leaf genre classification performances are quite below the results obtained by McKay

Genre Classification Using Both MIDI and Audio from MIDI
We explored the root and genre classification accuracy using MIDI and audio separately and found out that the accuracy varied between different feature sets and classifiers.However, the accuracies reached were far below the accuracies obtained by McKay.In this section, we investigate if we can get better results by combining MIDI and audio classifiers we obtained in the previous two sections.
According to Kuncheva [20], in order for classifier combination to be successful, classifiers need to be diverse.The probability that many classifiers, trained independently, will agree on the same wrong output is small.Therefore, majority voting could give the right answer for the many, independent and diverse classifiers case.
There are a number of methods to achieve diverse classifiers: a) Use independent subsamples of data to train each classifier, b) Use different set of features to train each classifier, c) Use different algorithms to train each classifier.
Usually a) and b) are more effective than c) [21].
MIDI distances and audio features give us an initial base of different features.We get still more different features by using different initial portions of the MIDI file and different sample rates and sizes for the audio file.The k-nearest neighbor and LDC classifiers also help achieve more diversity.Therefore, we have a pool of different classifiers whose votes we can combine to achieve better accuracy (Figure 4).
We consider four different flavors of weighted majority voting described by the weights w i given to each classifier: • w i =1: This voting scheme gives each classifier the same amount of vote.
• w i =2 if 1≤i≤4 or 9≤i≤12 and w i =1 if 5≤i≤8: Inspired by the fact that audio-10NN gives the worst results, this method gives less weight to those classifiers.
• w i proportial to accuracy of ith classifier: This method depends on the accuracy of each classifier and is not realizable in practice because accuracies are usually not known.• w i selected to maximize accuracy: This method exhaustively searches the w i 's in [0.2:1] interval and reports the w i .thatresults in the best accuracy.

Conclusions
In this paper we first classified genres using MIDI files using Normalized Compression Distance (NCD) and 10-nearest neighbor classifier.We converted MIDI files to audio and did genre classification using features at different sample rates and sizes and LDC and KNN classifiers.Finally, we combined 12 different classifiers we obtained at the previous steps, using different schemes of majority voting.We found out that majority voting improved the classification accuracy.Although the classification accuracies for MIDI or audio only were much below the results obtained by McKay [4], classifier combination improved genre classification.The approach outlined does not require any background musical knowledge, hence could be used to improve MIDI genre accuracy.
Currently, the audio to MIDI conversion is not very successful, especially when multiple instruments are used in the piece.We hope that as technology gets better, a similar approach could be used to improve audio genre classification.

Figure1:
Figure1:The change of audio features when different sample rates and sample sizes are used.

Figure 2 :
Figure 2: Root Genre classification using LDC classifier and different sets of features.

Figure 3 :
Figure 3: Leaf genre classification with LDC classifier using different sets of features.

Figure 4 .
Figure 4.A method to combine MIDI and audio features to predict the genre of a MIDI music piece.

Table 1
shows the root and leaf genre classification accuracies obtained using the first and last two (22.05Kbps or 44.1Kpbs and 8 or 16 bits) compression settings.LDC performs better than 10NN for both root and genre classification.

Table 1 .
Root and leaf genre test classification accuracies on audio data obtained from MIDI, using different compression settings and 10NN and LDC classifiers.

Table 2 .
. Root and leaf genre test classification accuracies on MIDI data using NCD distance and 10-nearest neighbor classifier.

Table 3 :
Root and leaf genre classification accuracies when classifiers are combined.Table4shows the confusion matrix entries for each of the base classsifiers.The entries are averaged over 100 train/test partitions and normalized to 100 per actual class.Each row corresponds to a classifier with a different feature and classification method.Second column shows whether the MIDI or Audio input is used and the type of classifier used.This column also shows the length of the used piece for MIDI and the sample rate and sample size for audio.Although the accuracies were similar, clearly the confusion matrices are different for each feature-classifier combination and this helped combination achieve better results.Another observation is Classic is recognized best when 30 seconds of MIDI file is used, whereas Pop benefits from longer files.While higher quality (i.e. more kbps and 16 bits) encoding usually helps Classic and Pop, the same is not true for Jazz. .

Table 4 .
Root genre confusion matrices for 12 different base classifiers.

Table 5
shows the confusion matrices for the classifier combinations.Using audio and LDC usually gave the best results on Table4, and Table 5's entries are better than that.Choosing classifier weights according to accuracies did not improve over the equal weighted majority voting.On the other hand, choosing the optimal weights according the specific set of samples being classified resulted in better performance.

Table 5 .
Root genre confusion matrices for four different combinations of base classifiers.