EURASIP Journal on Applied Signal Processing 2002:4, 372–378 c ○ 2002 Hindawi Publishing Corporation Audio Classification in Speech and Music: A Comparison Between a Statistical and a Neural Approach

We focus the attention on the problem of audio classification in speech and music for multimedia applications. In particular, we present a comparison between two different techniques for speech/music discrimination. The first method is based on Zero crossing rate and Bayesian classification. It is very simple from a computational point of view, and gives good results in case of pure music or speech. The simulation results show that some performance degradation arises when the music segment contains also some speech superimposed on music, or strong rhythmic components. To overcome these problems, we propose a second method, that uses more features, and is based on neural networks (specifically a multi-layer Perceptron). In this case we obtain better performance, at the expense of a limited growth in the computational complexity. In practice, the proposed neural network is simple to be implemented if a suitable polynomial is used as the activation function, and a real-time implementation is possible even if low-cost embedded systems are used.


INTRODUCTION
Effective navigation through multimedia documents is necessary to enable widespread use and access to richer and novel information sources.
Design of efficient indexing techniques to retrieve relevant information is another important requirement. Allowing for possible automatic procedures to semantically index audio-video material represents therefore a very important challenge. Such methods should be designed to create indices of the audio-visual material, which characterize the temporal structure of a multimedia document from a semantic point of view.
The International Standard Organization (ISO) started in October 1996 a standardization process for the description of the content of multimedia documents, namely MPEG-7: the "Multimedia Content Description Interface" [1,2]. However, the standard specifications do not indicate methods for the automatic selection of indices.
A possible mean is to identify series of consecutive seg-ments, which exhibit a certain coherence, according to some property of the audio-visual material. By organizing the degree of coherence, according to more abstract criteria, it is possible to construct a hierarchical representation of information, so as to create a Table of Content description of the document. Such description appears quite adequate for the sake of navigation through the multimedia document, thanks to the multi-layered summary that it provides [3,4]. Traditionally, the most common approach to create an index of an audio-visual document has been based on the automatic detection of the changes of camera records and the types of involved editing effects. This kind of approach has generally demonstrated satisfactory performance and lead to a good low-level temporal characterization of the visual content. However, the reached semantic level remains poor since the description is very fragmented considering the high number of shot transitions occurring in typical audio-visual programs.
Alternatively, there have been recent research efforts to base the analysis of audio-visual documents by a joint audio and video processing so as to provide for a higher-level organization of information [5,6,7,8]. In [7,8] these two sources of information have been jointly considered for the identification of simple scenes that compose an audio-visual program. The video analysis associated to cross-modal procedures can be very computationally intensive (by relying, e.g., on identifying correlation between nonconsecutive shots).
We believe that audio information carries out by itself a rich level of semantic significance, and this paper focuses on this issue.
In particular, we propose and compare two simple speech/music discrimination schemes for audio segments.
The first approach, based mainly on Zero Crossing Rate (ZCR) and Bayesian classification, is very simple from a computational complexity point of view, and gives good results in case of pure music or speech. Some problems arises when the music segment contains also some speech superimposed on music, or strong rhythmic components.
To overcome this problem, we propose an alternative method, that uses more features and is based on neural networks (specifically a Multi Layer Perceptron, MLP). In this case we obtain better performance, at the expense of an increased computational complexity. Anyway, the proposed neural network is simple to be implemented if a suitable polynomial is used as the activation function, and a real-time implementation is possible even if low-cost embedded systems are used.
The paper is organized as follows. Section 2 is devoted to a brief description of the solutions for speech/music discrimination presented in the literature. The proposed algorithms are described, respectively, in Sections 3 and 4, whereas in Section 5 we report and discuss the experimental results. Some concluding remarks are given in Section 6.

STATE OF THE ART SOLUTIONS
In this section, we focus the attention on the solutions proposed in the literature to the problem of speech/music discrimination.
Saunders [9] proposed a method based on the statistical parameters of the ZCR, plus a measure of the short time energy contour. Then, using a multivariate Gaussian classifier, he obtained a good percentage of class discrimination. This approach is successful for discriminating speech from music on a broadcast FM radio program, and it allows achieving the goal for the low computational complexity and for the relative homogeneity of this type of audio signal.
Scheirer and Slaney [10] developed another approach to the same problem, which exploits different features still achieving similar results. Even in this case the algorithm achieves real-time performance and uses time domain features (short-term energy, ZCR) and frequency domain features (4 Hz modulation energy, spectral rolloff point, centroid and flux, . . . ), extracting also their variance in one second segments. In this case, they use some methods for the classification (Gaussian mixture model, K-nearest neighbor), and they obtain similar results.
Foote [11] adopted a technique purely data-driven, and he did not extract subjectively "meaningful" acoustic parameters. In his work, the audio signal is first parameterized into Mel-scaled Cepstral coefficients plus an energy term, obtaining a 13-dimensional feature vector (12 MFCC plus energy) at a 100 Hz frame rate. Then using a tree-based quantization the audio is classified into speech, music, and novocal sounds.
Saraceno and Leonardi [7], and Zhang and Kuo [12] proposed more sophisticated approaches to achieve a finest decomposition of the audio stream. In both works the audio signal is decomposed at least in four classes: silence, music, speech, and environmental sounds.
In the first work, at the first stage, a silence detector is used, which divides the silence frames from the others with a measure of the short time energy. It considers also their temporal evolution by dynamic updating of the statistical parameters, and by means of a finite state machine, to avoid misclassification errors. Hence, the three remaining classes are divided using autocorrelation measures, local as well as contextual, and the ZCR, obtaining good results, where misclassifications occur mainly at the boundary between segments belonging to different classes.
In [12] the classification is performed at two levels: a coarse and a fine level. For the first level, it is used a morphological and statistical analysis of the energy function, average ZCR and the fundamental frequency. Then a rulebased heuristic procedure is proposed to classify audio signals based on these features. At the second level, a further classification is performed for each type of sounds. Because this finest classification is inherently semantic, for each class a different approach could be used. The results for the coarse level show a good accuracy, and misclassification usually occurs in hybrid sounds, which contains more than one basic type of audio.
Liu et al. [13] used another kind of approach, because their aim was to analyze the audio signal for a scene classification of TV programs. The features selected for this task are both in time and frequency domain, and they are meaningful for the scene separation and classification. These features are: no-silence ratio, volume standard deviation, volume dynamic range, frequency component at 4 Hz, pitch standard deviation, voice of music ratio, noise or unvoiced ratio, frequency centroid, bandwidth and energy in 4 sub-bands of the signal. Feedforward neural networks are used successfully as pattern classifiers in this work. The recognized classes are advertisement, basketball, football, news, weather forecasts, and the results show the usefulness of using audio features for the purpose of scene classifications.
An alternative approach in audio data partitioning consists in a supervised partitioning. The supervision concerns the ability to train the models of the various clusters considered in the partitioning. In literature, the Gaussian Mixture Models (GMM) [14] are frequently used to train the models of the chosen clusters. From a reference segmented and labeled database, the GMMs are trained on acoustic data for modeling characterized clusters (e.g., speech, music, and background). The great variability of noises (e.g., rumbling, explosion, creaking), and of music (e.g., classic, pop) observed on the audio-video databases (e.g., broadcast news, movie films) makes difficult to select a suitable training strategy of the models of the various clusters characterizing these sounds. The main problem to train the models is the segmentation/labeling of large audio databases allowing a statistical training. So long as the automatic partitioning is not perfect, the labeling of databases is time consuming of human experts. To avoid this cost and to cover the processing of any audio document, the characterization must be generic, and an adaptation of the techniques of data partitioning on the audio signals is required to minimize the training of the various clusters of sounds.
In general, these algorithms suffer some performance degradation when the music segment contains some speech superimposed on music, or strong rhythmic components. As previously mentioned, in our work, to overcome these problems, we propose a method that is based on neural networks, that gives good performance also in these specific cases, at the expense of a limited growth in the computational complexity. The performance of this method are then compared to that obtained using a statistical approach based on ZCR and Bayesian classification.

ZCR WITH BAYESIAN CLASSIFIER
As previously mentioned, several researchers assume an audio model composed of four classes: silence, music, speech, and noise.
In this work, we focus the attention on the specific problem of audio classification in music and speech, assuming that the silence segments have already been identified using, for example, the method proposed in [8].
For this purpose, we use a speech characteristic to discriminate it from the music; the speech shows a very regular structure where the music does not show it. Indeed, the speech is composed of a succession of vowels and consonants: while the vowels are high energy events with most of the spectral energy contained at low frequencies, the consonant are noise-like, with the spectral energy distributed more towards the higher frequencies.
Saunders [9] used the ZCR, which is a good indicator of this behavior, as shown in Figure 1.
In our algorithm, depicted in Figure 2, the audio file is partitioned into segments of 2.04 second; each of them is composed of 150 consecutive nonoverlapping frames. These values allow a statistical significance of the frame number and, using a 22050 Hz sample frequency, each frame contains 300 samples, which is an adequate tradeoff between the quasi-stationary properties of the signal and a sufficient length to accurately evaluate the ZCR. For every frame, the value of the ZCR is then calculated using the definition given in [9].
These 150 values of the ZCR are then used to estimate the following statistical measures: • variance: which indicates the dispersion with respect to the mean value;  • third-order moment: which indicates the degree of skewness with respect to the mean value; • difference between the number of ZCR samples, which are above and below the mean value.
Each segment of 2.04 seconds is thus associated with a 3dimensional vector.
To achieve the separation between speech and music using a computationally efficient implementation, a multivariate Gaussian classifier has been used. A set of about 400 4-second-long audio sample, equally distributed between speech and music, have been used to characterize the classifier. At the end of this step we obtain a set of consecutive segments labeled like speech or no-speech.
The next step is justified by an empirical observation: the probability to observe a single segment of speech surrounded of music segments is very low, and vice versa. Therefore, a simple regularization procedure is applied to properly set the labels of these spurious segments.
The boundaries between segments of different classes are placed in fixed positions, inherently to the nature of the ZCR algorithm. Obviously these boundaries are not placed in a sharp manner, thus a fine-level analysis of the segments across the boundaries is needed to determine a sharp placement of them. In particular, the ZCR values of the neighboring segments are processed to identify the exact position of the transition between speech and music signal. A new signal is obtained from these ZCR values, applying this function where x[n] is the nth ZCR value of the current segment, and x n is defined asx  Therefore, y[n] is an estimation of the ZCR variance in a short window. A low-pass filter is then applied to this signal to obtain a smoother version of it, and finally a peak extractor is used to identify the transition between speech and music.

NEURAL NETWORK CLASSIFIER
The second approach we propose is based on a Multi-Layer Perceptron (MLP) network [15].
The MLP has been trained using five classes of audio traces, supposing other audio sources, as silence or noise, to be previously removed. The classes of audio traces considered have been, namely: instrumental music without voice, as Beethoven symphony no. 6 (class labeled as "Am"), melodic songs, as "My heart will go on" from Titanic (class labeled as "Bm"), rhythmic songs, as rap music or Dire Straits song "Sultans of swing" (class labeled as "Cm"), pure speech (class labeled as "Av"), and speech superimposed on music (class labeled as "Bv"), as commercials.
In the literature main features have been suggested for speech/music discrimination, for example, see [16]. In this work, we have analysed more than 30 features, and eight of them have been selected as the neural network inputs. These parameters have been computed considering 86 frames by 1024 points each (sampling frequency f s = 22050 Hz), with a total observing time of about 4 seconds.
To test the effectiveness of the various features, and to train the MLP, a set of about 400 4-second-long audio sam-ples have been considered belonging to the five classes labeled as Am, Bm, Cm, Av, Bv, and equally distributed between speech (Av, Bv) and music (Am, Bm, Cm). The discrimination power of the selected features has been firstly evaluated by computing the index α, defined by (3), for each feature P j , with j = 1 to 8, where µ m and σ m are, respectively, the mean value and standard deviation of parameter P j for music samples, and µ v and σ v are the same for speech. If parameter P j follows a Gaussian distribution, an α-value equal to 1 yields to a statistical classification error of about 15%. α-values between 0.7 and 1 result for the selected features A short description of the eight selected features follows. Parameter P 1 is the spectral flux, as suggested in [10]. It indicates how rapidly changes the frequency spectrum, with particular attention to the low frequencies (up to 2.5 kHz), and it generally assumes higher values for speech.
Parameters P 2 and P 3 are related to the short-time energy [17]. Function E(n), with n = 1 to 86, is computed as the sum of the square value of the previous 1024 signal samples. A fourth-order high-pass Chebyshev filter is applied with about 100 Hz as the cutting frequency. Parameter P 2 is computed as the standard deviation of the absolute value of the resulting signal, and it is generally higher in speech. Parameter P 3 is the minimum of the short-time energy and it is generally lower in speech, due to the pauses that occur among words or syllables.
Parameters P 4 and P 5 are related to the cepstrum coefficients, evaluated using log X e jw e jwn dw.
Cepstrum coefficients c j (n), suggested in [18] as good speech detectors, have been computed for each frame, then the mean value c µ (n) and the standard deviation c σ (n) have been calculated, and parameters P 4 and P 5 result as indicated in P 4 = c µ (9) · c µ (11) · c µ (13), Parameter P 6 is related to the centroid that is computed starting from the spectrum module of each frame.
Parameter P 6 is the product of the mean value by the standard deviation computed by the 86 values of barycentre. In fact, due to the speech discontinuity, standard deviation makes this parameter more distinctive.
Parameter P 7 is related to the ratio of the high-frequency power spectrum (7.5 kHz < f < 11 kHz) to the whole power spectrum. The speech spectrum is usually considered up to 4 kHz, but the lowest limit has been increased to consider signals with speech over music. To consider the speech discontinuity and increase the discrimination between speech and music, P 7 is the ratio of the mean value to the standard deviation obtained by the 86 values of the relative high-frequency power spectrum. Parameter P 8 is the syllabic frequency [10] computed starting from the short-time energy calculated on 256 samples (≈ 12 ms) instead of 1024. A 5-taps median filter has filtered this signal, and the computed syllabic frequency (P 8 ) is the number of peaks detected in 4 seconds. As it is known, music should present a greater number of peaks [10].
The proposed MLP has eight input, corresponding to the normalized features P 1 ÷ P 8 , fifteen hidden neurons, five output neurons, corresponding to the five considered classes, and uses normalized sigmoid activation function.
The 400 audio samples, that have been used also for the ZCR with Bayesian classifier, have been divided into three sets: training (200 samples), validation (100 samples), and test (100 samples). Each sample is formatted as {P 1 ÷ P 8 , P Av , P Bv , P Am , P Bm , P Cm }, where P Av is the probability that sample belongs to class Av.
The goal is to distinguish between speech and music and not to identify the class; for this purpose a different and more complex set of parameters should be designed. To perform the proposed binary classification, target has been assigned with "1" to the selected class, "0" to the farest class, a value between 0.8 and 0.9 to the similar classes, and a value between 0.1 and 0.2 to the other classes. For instance, if a sample of Bm (melodic songs) is considered, P Bm = 1, P Am = P Cm = 0.8 because music is dominant, P Bv = 0.2 because it is anyway a mix of music and voice, and P Av = 0.1, because the selected sample contains voice.
If a pure music sample is considered (class Am), P Am = 1, P Bm = P Cm = 0.8 because it is a mix of music and voice where music is dominant, P Bv = 0.1 because it contains music, and P Av = 0, because pure speech is the farest class. In fact, classifying the speech over music as speech inclines the MLP to classify as speech some rhythmic songs: by adjusting the sample target it is possible to incline to one side or another the MLP response.
The MLP has been trained using the Levenberg-Marquardt method [19] with a starting value of µ equal to 1000 (slow and accurate behavior). The decision algorithm is depicted in Figure 3.
The mean square error related to the 400 samples was about 4%. It should be noticed that most of the music samples wrongly classified as speech belonged to the class Cm, that is, rhythmic songs as, for example, rap music.
The selected features are rather simple to be computed even by a low-cost device (DSP, microcontroller), except for parameters P 4 and P 5 , related to the cepstrum coefficients. If P 4 and P 5 are neglected, and a 6-inputs MLP is used, the mean square error related to the 400 samples increases to about 5%. The neural network is simple to be implemented if a suitable polynomial is used as the activation function [20], and a real-time implementation is possible even if low-cost embedded systems are used.
Output y is updated every 4 seconds, and this could be a limit to finely detect the exact position of class changes. To increase the output updating frequency, a circular frame buffer has been provided, and features p j , in terms of mean value and standard deviation, are updated every 186 ms, corresponding to 4 frames f i , as shown in Figure 4.
The new updating frequency has been chosen as the fastest to be implemented on a low-cost DSP (TMS320C31). In addition, this operation allow low-pass filters to be applied to the MLP output before the maximum value has been computed.

SIMULATION RESULTS
The proposed algorithms have been tested by computer simulations to estimate the classification performance. The tests carried out can be divided into two categories: the first one is about the misclassification errors, while the second one is about the precision in music-speech and speech-music change detection.
Considering the misclassification errors, we defined three parameters as follows: • MER (Music Error Rate): it represents the ratio between the total duration of the music segments misclassified, and the total duration of the music test file.
• SER (Speech Error Rate): it represents the ratio between the total duration of the speech segments misclassified, and the total duration of the speech test file.
• TER (Total Error Rate): it represents the ratio between the total duration of the segments misclassified in the wrong category (both music and speech), and the total duration of the test file.
The selection of the test files was carried out "manually," that is, each file is composed of many pieces of different types of audio (different speakers over different environmental noise, different kinds of music such as classical, pop, rap, funky, etc.) concatenated in order to have a five minutes segment of speech followed by a five minutes segment of music, and so on, for a total duration of 30 minutes.
All the content of this file has been recorded from an FM radio receiver, and it has been sampled at a frequency of 22050 Hz, with a 16-bit uniform quantization.
The classification results for both the proposed methods are shown in Table 1.  From the analysis of the simulation results, we can see that the MLP method gives better results compared to the ZB one, having a lower error rate both in music and speech.
Moreover, both the methods show the worst performance in the classification of the music segments, that is, many segments of music are classified as speech than viceversa. For a better understanding of these results, Figure 5.
In the figure, the white intervals represent the segments classified as speech, whereas the black ones show the segments classified as music. From this figure, it appears clearly that the worst classification results are obtained in the third music segment, between the minutes 20 and 25. The explanation is that these pieces of music contain strong voiced components, under a weak music component (e.g., rap and funky). The neural network makes some mistakes only with the rap song (minutes 23-24 referred in Figure 5), while the ZB approach misclassifies the funky song (minutes 20-23) too. Commercials, that includes speech with music in background, are present in the test file at minutes 17-18: in this case the ZB approach shows only some uncertainties.
The problem related to music identification is due mainly to the following reasons: • The MLP has been trained to recognize also music with a voiced component, and it gets wrong only if the voiced component is too rhythmic (e.g., rap song in our case). On the other hand, the Bayesian classifier used in the ZB approach does not take into account cases with mixed component (music and voice), and therefore in this case the classification results are significantly affected by the relative strongness of the spurious components.
• Furthermore, the ZB approach, that uses very few parameters, is inherently unable to discriminate between pure speech and speech with music background, while the MLP network, which uses more features, is able to make it.
Considering the precision of music-speech and speechmusic change detection, we measured the distance between the correct point in the time scale when a change occurred, and the nearest change point automatically extracted from the proposed algorithms. In particular, we have measured the maximum, minimum, and the mean interval between the real change and the extracted one. The results are shown in Table 3(b), where PS2M (Precision Speech to Music) is the error in speech to music change detection, and PM2S (Precision Music to Speech) is the error in music to speech change detection.
Also in this case, the MLP obtains better performance than the ZB.

CONCLUSION
In this paper, we have proposed and compared two different algorithms for audio classification into speech and music. The first method is based mainly on ZCR and Bayesian classification (ZB). It is very simple from a computational point of view and gives good results in case of pure music or speech. Anyway some performance degradation arises when the music segment contains also some speech superimposed on music, or strong rhythmic components. We have proposed therefore a second method that is based on a Multi-Layer Perceptron. In this case we obtain better performance, at the expense of a limited growth in the computational complexity. In practice, a real-time implementation is possible even if low-cost embedded systems are used.