Particle Filtering Applied to Musical Tempo Tracking

This paper explores the use of particle ﬁlters for beat tracking in musical audio examples. The aim is to estimate the time-varying tempo process and to ﬁnd the time locations of beats, as deﬁned by human perception. Two alternative algorithms are presented, one which performs Rao-Blackwellisation to produce an almost deterministic formulation while the second is a formulation which models tempo as a Brownian motion process. The algorithms have been tested on a large and varied database of examples and results are comparable with the current state of the art. The deterministic algorithm gives the better performance of the two algorithms.


INTRODUCTION
Musical audio analysis has been a growing area for research over the last decade.One of the goals in the area is fully automated transcription of real polyphonic audio signals, though this problem is currently only partially solved.More realistic sub-tasks in the overall problem exist and can be explored with greater success; beat tracking is one of these and has many applications in its own right (automatic accompaniment of solo performances [1], auto-DJs, expressive rhythmic transformations [2], uses in database retrieval [3], metadata generation [4], etc.).
This paper describes an investigation into beat tracking utilising particle filtering algorithms as a framework for sequential stochastic estimation where the state-space under consideration is a complex one and does not permit a closed form solution.
Historically, a number of methods have been used to attempt solution of the problem, though they can be broadly categorised into a number of distinct methodologies. 1 The oldest approach is to use oscillating filterbanks and to look for the maximum output; Scheirer [7] typifies this approach though Large [8] is another example.Autocorrelative methods have also been tried and Tzanetakis [3] or Foote [9] are examples, though these tend to only find the average tempo and not the phase (as defined in Section 2) of the beat.Multiple hypothesis approaches (e.g., Goto [10] or Dixon [11]) are very similar to more rigorously probabilistic approaches (Laroche [12] or Raphael [13], for instance) in that they all evaluate they likelihood of a hypothesis set; only the framework varies from case to case.Klapuri [14] also presents a method for beat tracking which takes the approach typified by Scheirer [7] and applies a probabilistic tempo smoothness model to the raw output.This is tested on an extensive database and the results are the current state of the art.
More recently, particle filters have been applied to the problem; Morris and Sethares [15] briefly present an algorithm which extracts features from the signal and then uses these feature vectors to perform sequential estimation, though their implementation is not described.Cemgil [16] also uses a particle filtering method in his comprehensive paper applying Monte Carlo methods to the beat tracking of expressively performed MIDI signals. 2 This model will be discussed further later, as it shares some aspects with one of the models described in this paper.
The remainder of the paper is organised as follows: Section 2 introduces tempo tracking; Section 3 covers basic particle filtering theory.Sections 4, 5 and 6 discuss onset detection and the two beat tracking models proposed.Results and discussion are presented in Sections 7 and 8, and conclusions in Section 9.

TEMPO TRACKING AND BEAT PERCEPTION
So what is beat tracking? 3The least jargon-ridden description is that it is the pulse defined by a human listener tapping in time to music.However, the terms tempo, beat and rhythm need to be defined.The highest level descriptor is the rhythm; this is the full description of every timing relationship inside a piece of music.However, Bilmes [17] breaks this down into four subdivisions: the hierarchical metrical structure which describes the idealised timing relationships between musical events (as they might exist in a musical score for instance), tempo variations which link these together in a possibly time varying flow, timing deviations which are individual timing discrepancies ("swing" is an example of this) and finally arrhythmic sections.If one ignores the last of these as fundamentally impossible to analyse meaningfully, the task is to estimate the tempo curve (tempo tracking) and idealised event times quantised to a grid of "score locations," given an input set of musical changepoint times.
To represent the tempo curve, a frequency and phase is required such that the phase is zero at beat locations.The metrical structure can then be broken down into a set of levels described by Klapuri [14]: the beat or tactus is the preferred human tapping tempo; the tatum is the shortest commonly occurring interval; and the bar or measure is related to harmonic change and often correlates to the bar line in common score notation of music.It should be noted that the beat often corresponds to the 1/4 note or crotchet in common notation, but this is not always the case: in fast jazz music, the beat is often felt at half this rate; in hymn music, traditional notation often gives the beat two crotchets (i.e., 1/2 note).The moral is that one must be careful about relating perception to musical notation!Figure 1 gives a diagrammatic representation of the beat relationships for a simple example.The beat is subdivided by two to get the tatum and grouped in fours to find the bar.The lowest level shows timing deviations around the fixed metrical grid.
Perception of rhythm by humans has long been an active area of research and there is a large body of literature on the subject.Drake et al. [18] found that humans with no musical training were able to tap along to a musical audio sample "in time with the music," though trained musicians were able to do this more accurately.Many other studies have been undertaken into perception of simple rhythmic patterns (e.g., Povel and Essens [19]) and various models of beat perception have been proposed (e.g., [20,21,22]) from which ideas can be gleaned.However, the models presented in the rest of this paper are not intended as perceptual models or even as perceptually motivated models; they are engineering equiva- 3 A fuller discussion on this topic can be found in [6].lents of the human perception.Having said that, it is hoped that a successful computer algorithm could help shed light onto potential and as yet unexplained human cognitive processes.

Problem statement
To summarise, the aim of this investigation is to extract the beat from music as defined by the preferred human tapping tempo; to make the computer tap its hypothetical foot along in time to the music.This requires a tempo process to be explicitly estimated in both frequency and phase, a beat lying where phase is zero.In the process of this, detected "notes" in the audio are assigned "score locations" which is equivalent to quantising them to an underlying, idealised metrical grid.We are not interested in real time implementation nor in causal beat tracking where only data up to the currently considered time is used for estimation.

PARTICLE FILTERING
Particle filters are a sequential Monte Carlo estimation method which are powerful, versatile and increasingly used in tracking problems.Consider the state space system defined by where where h k : nx × nν → ny is a separate possibly nonlinear transform and ν k is a separate i.i.d.noise process of dimension n ν describing the observation error.The posterior of interest is given by p(x 0:k |y 1:k ) which is represented in particle filters by a set of point estimates or particles {x (i)  0:k , w (i) , where {x (i) 0:k , i = 1, . . ., N} is a set of support points with associated weights given by {w (i)  k , i = 1, . . ., N}.The weights are normalised such that N i=1 w (i) 1.The posterior is then approximated by As N → ∞, this assumption asymptotically tends to the true posterior.The weights are then selected according to importance sampling, x (i) 0:k ∼ π(x (i) 0:k |y 1:k ), where π(•) is the so-called importance density.The weights are then given by If we restrict ourselves to importance functions of the form, implying a Markov dependency of order 1, the posterior can be factorised to give which allows sequential update.The weights can then be proven to be updated [23] according to up to a proportionality.Often we are interested in the filtered estimate p(x k |y 1:k ) which can be approximated by Particle filters often suffer from degeneracy as all but a small number of weights drop to almost zero, a measure of this being approximated by [23].Good choice of the importance density π(x k |x 0:k−1 , y 1:k ) can delay this and is crucial to general performance.The introduction of a stochastic jitter into the particle set can also help [24]; however the most common solution is to perform resampling [25] whereby particles with small weights are eliminated and a new sample set {x (i) * k } N i=1 is generated by resampling N times from the approximate posterior as given by (8) such that Pr( k .The new sample set is then more closely distributed according to the true posterior and the weights should be set to w (i)  k = 1/N to reflect this.Further details on particle filtering can be found in [23,26].
A special case of model is the jump Markov linear systems (JMLS) [27] where the state space, x 0:k , can be broken down into {r 0:k , z 0:k }. r 0:k , the jump Markov process, defines a path through a bounded and discrete set of potential states and conditional upon r 0:k , z 0:k is then defined to be linear Gaussian.The chain rule gives the expansion, p r 0:k , z 0:k |y 1:k = p z 0:k |r 0:k , y 1:k p r 0:k |y 1:k , (9) and p(x 0:k |r 0:k , y 1:k ) is deterministically evaluated via the Kalman filter equations given below in Section 5.After this marginalisation process (called Rao-Blackwellisation [28]), p(r 0:k |y 1:k ) is then expanded as with associated (unnormalised) importance weights given by By splitting the state space up in this way, the dimensionality considered in the particle filter itself is dramatically decreased and the number of particles needed to achieve a given accuracy is also significantly reduced.

CHANGE DETECTION
The success of any algorithm is dependent upon the reliability of the data which is provided as an input.Thus, detecting note events in the music for the particle filtering algorithms to track is as important as the actual algorithms themselves.
The onset detection falls into two categories; firstly there is detection of transient events which are associated with strong energy changes, epitomised by drum sounds.Secondly, there is detection of harmonic changes without large associated energy changes (e.g., in a string quartet).To implement the first of these, our method approximately follows many algorithms in the literature [7,11,12]: frequency bands, f , are separated and an energy evolution envelope E f (n) formed.A three point linear regression is used to find the gradient of E f (n) and peaks in this gradient function are detected (equivalent to finding sharp, positive increases in energy which hopefully correspond to the start of notes).Low-energy onsets are ignored and when there are closely spaced pairs of onsets, the lower amplitude one is also discarded.Three frequency bands were used: 20-200 Hz to capture low frequency information; 200 Hz-15 kHz which captures most of the harmonic spectral region; and 15-22 kHz which, contrary to the opinion of Duxbury [29], is generally free from harmonic sounds and therefore clearly shows any transient information.
Harmonic change detection is a harder problem and has received very little attention in the past, though two recent studies have addressed this [29,30].To separate harmonics in the frequency domain, long short-time Fourier transform (STFT) windows (4096 samples) with a short hop rate (1/8 frame) were used.As a measure of spectral change from one frame to the next, a modified Kullback-Liebler distance measure was used: , where X[k, n] is the STFT with time index n and frequency bin k.The modified measure is thus tailored to accentuate positive energy change.K defines the region 40 Hz-5 kHz where the majority of harmonic energy is to be found and to pick peaks, a local average of the function D MKL was formed and then the maximum picked between each of the crossings of the actual function and the average.A further discussion of the MKL measure can be found in [31] but a comprehensive analysis is beyond the scope of this paper.For beat tracking purposes, it is desirable to have a low false detection rate, though missed detections are not so important.While no actual rates for false alarms have been determined, the average detected inter-onset interval (IOI) was compared with an estimate given by T/(N b × F), where T is the length of the example in seconds, N b is the number of manually labelled beats and F is the number of tatums in a beat.The detected average IOI was always of the order or larger than the estimate, which shows that under-detection is occurring.
In summary, there are four vectors of onset observations, three from energy change detectors and one from a harmonic change detector.The different detectors may all observe an actual note, or any combination of them might not.In fact, clustering of the onset observations from each of the individual detection functions is performed prior to the start of the particle filtering.A group is formed if any events from different streams fall within 50 ms of each other for transient onsets and 80 ms for harmonic onsets (reflecting the lower time resolution inherent in the harmonic detection process).Inspection of the resulting grouped onsets shows that the inter-group separation is usually significantly more than the within-group time differences.A set of amplitudes is then associated with each onset cluster.

BEAT MODEL 1
The model used in this section is loosely based on that of Cemgil et al. [16], designed for MIDI signals.Given the series of onset observations generated as above, the problem is to find a tempo profile which links them together and to assign each observation to a quantised score location.
The system can be represented as a JMLS where conditional on the "jump" parameter, the system is linear Gaussian and the traditional Kalman filter can be used to evaluate the sequence likelihood.The system equations are then where x k is the tempo process at iteration k and can be described as ρ k is then the predicted time of the kth observation and ∆ k the tempo period, that is, ∆ k = 60/T k , where T k is the tempo in beats per minute (bpm).This is equivalent to a constant velocity process and the state innovation, ξ k is modelled as zero mean Gaussian with covariance Q k .
To solve the quantisation problem, the score location is encoded as the jump parameter, γ k , in Φ k (γ k ).This is equivalent to deciding upon the notation that describes the rhythm of the observed notes.Φ k (γ k ), is then given by This associated evolution covariance matrix is [32] for a continuous constant velocity process which is observed at discrete time intervals, where q is a scale parameter.While the state transition matrix is dependent upon γ k , this is a difference term between two actual locations, c k and c k−1 .It is this process which is important and the prior on c k becomes a critical issue as it determines the performance characteristics.Cemgil breaks a single beat into subdivisions of two and uses a prior related to the number of significant digits in the binary expansion of the quantised location.Cemgil's application was in MIDI signals where there is 100% reliability in the data and the onset times are accurate.In audio signals, the event detection process introduces errors both in localisation accuracy and in generating entirely spurious events.Also, Cemgil's prior cannot cope with triplet figures or swing.Thus, we break the notated beat down into 24 quantised sub-beat locations, c k = {1/24, 2/24, . . ., 24/24, 25/24, . ..} and assign a prior where d(c k ) is the denominator of the fraction of c k when expressed in its most reduced form; that is, d(3/24) = 8, d(36/24) = 2, and so forth.This prior is motivated by the simple concern of making metrically stronger sub-beat locations more likely; it is a generic prior designed to work with all styles and situations.
Finally, the observation model must be considered.Bearing in mind the pre-processing step of clustering onset observations from different observation function, the input to the particle filter at each step y k will be a variable length vector containing between one and four individual onset observation times.Thus, H k becomes a function of the length j of the observation vector y k but is essentially j rows of the form [1 0].The observation error ν k is also of length j and is modelled as zero-mean Gaussian with diagonal covariance R k where the elements r j j are related to whichever observation vector is being considered at y k ( j).
Thus, conditional upon the c k process which defines the update rate, everything is modelled as linear Gaussian and the traditional Kalman filter [33] can be used.This is given by the recursion Each particle must maintain its own covariance estimate P(k|k) as well as its own state.The innovation or residual vector is defined to be the difference between the measured and predicted quantities, and has covariance given by

Amplitude modelling
The algorithm as described so far will assign the beat (i.e., the phase of c 1:k ) to the most frequent subdivision, which may not be the right one.To aid the correct determination of phase, attention is turned to the amplitude of the onsets.
The assumption is made that the onsets at some score locations (e.g., on the beat) will have higher energy than others.Each of the three transient onset streams maintains a separate amplitude process while the harmonic onset stream does not have one associated with it due to amplitude not being relevant for this feature.
The amplitude processes can be represented as separate JMLSs conditional upon c k .The state equations are given by where a l p is the amplitude of the pth onset from the observation stream, l.Thus, the individual process is maintained for each observation function and updated only when a new observation from that stream is encountered.This requires the introduction of conditioning on p rather than k; 1:p then represents all the indices within the full set 1:k, where an observation from stream l is found.Θ l p (c p−1 , c p ) is a function of c p and c p−1 .To build up the matrix, Θ l p , a selection of real data was examined and a 24 × 24 matrix constructed for the expected amplitude ratio between a pair of score locations.This is then indexed by the currently considered score location c p and also the previously identified one found in stream l, c l p−1 , and the value given is returned to Θ l p .For example, it could be that the expected amplitude for a beat is modelled as twice that of a quaver off-beat.If the particle history shows that the previous onset from a given stream was assigned to be on the beat and the currently considered location is a quaver, Θ l p would equal 0.5.This relative relationship allows the same model to cope with both quiet and loud sections in a piece.The evolution and observation error terms, p and σ p , are assumed to be zero mean Gaussian with appropriate variances.
From now on, to avoid complicating the notation, the amplitude process will be represented without sums or products over the three l vectors using a p = {a 1  p , a 2 p , a 3 p } and α p = {α 1  p , α 2 p , α 3 p } (noting that some of these might well be given a null value at any given iteration).For each iteration k, between zero and all three of the amplitude processes will be updated.

Methodology
Given the above system, a particle filtering algorithm can be used to estimate the posterior at any given iteration.The posterior which we wish to estimate is given by p(c 1:k , x 1:k , α 1:p |y 1:k , a 1:p ) but Rao-Blackwellisation breaks down the posterior into separate terms where p(x 1:k |c 1:k , y 1:k ) and p(α 1:p |c 1:k , a 1:p ) can be deduced exactly by use of the traditional Kalman filter equations.Thus the only space to search over and perform recursion upon is that defined by p(c 1:k |y 1:k , a 1:p ).This space is discrete but too large to enumerate all possible paths.Thus we turn to the approximation approach offered by particle filters.
By assuming that the distribution of c k is dependent only upon c 1:k−1 , y 1:k and a 1:p , the importance function can be factorised into terms such as π(c k |y 1:k , a 1:p , c 1:k−1 ).This allows recursion of the Rao-Blackwellised posterior where and recursive updates to the weight are given by For k = 1 for i = 1 : N; draw x (i) 1 , α (i) 1 and c (i) 1 from respective priors for k = 2 : end for i = 1 : N Propagate particle i to a set, s = {1, . . ., S} of new locations c (s) k .Evaluate the new weight w (s,i) k for each of these by propagating through the respective Kalman filter.This generates π(c k |y 1:k , a 1:p , c (i)  1:k−1 ). for i = 1 : N Pick a new state for each particle from π(c k |y 1:k , a 1:p , c (i) 1:k−1 ).Update weights according to (25).
The terms p(y k |y 1:k−1 , c 1:k ) and p(a p |a 1:p−1 , c 1:k ) are calculated from the innovation vector and covariance of the respective Kalman filters (see (19) and ( 20)).p(c k |c k−1 ) is simplified to p(c k ) and is hence the prior on score location as given in Section 5.

Algorithm
The algorithm therefore proceeds as given in Algorithm 1.At each iteration, each particle is propagated to a set S of new score locations and the probability of each is evaluated.Given the N × S set of potential states there are then two ways of choosing a new set of updated particles: either stochastic or deterministic selection.The first proceeds in a similar manner to that described by Cemgil [16] where for each particle the new state is picked from the importance function with a given probability.Deterministic selection simply takes the best N particles from the whole set of propagated particles.Fully stochastic resampling selection of the particles is not an optimal procedure in this case, as duplication of particles is wasteful.This leaves a choice between Cemgil's method of stochastically selecting one of the update proposals for each particle or the deterministic N-best approach.The latter has been adopted as intuitively sensible.
Particle filters suffer from degeneracy in that the posterior will eventually be represented by a single particle with high weight while many particles have negligible probability mass.Traditional PFs overcome this with resampling (see [23]) but both methods for particle selection in the previous section implicitly include resampling.However, degeneracy still exists, in that the PF will tend to converge to a single c k state, so a number of methods were explored for increasing the diversity of the particles.Firstly, jitter [24] was added to the tempo process to increase local diversity.Secondly, a Metropolis-Hastings (MH) step [34] was used to explore jumps to alternative phases of the signal (i.e., to jump from tracking off-beat quavers to being on the beat).Also, an MH step to propose related tempos (i.e., doubling or halving the tracked tempo) was investigated but found to be counterproductive.

BEAT MODEL 2
The model described above formulates beat location as the free variable and time as a dependent, non-continuous variable, which seems counter-intuitive.Noting that the model is bilinear, a reformulation of the tempo process is thus presented now where time is the independent variable and tempo is modelled as Brownian motion4 [35].The state vector is now given by z k = [τ k, τk ] T where τ k is in beats and τk is in beats per second (obviously related to bpm).Brownian motion, which is a limiting form of the random walk, is related to the tempo process by where q controls the variance of the Brownian motion process B(t) (which is loosely the integral of a Gaussian noise process [32]) and hence the state evolution.This leads to Time t is now a continuous variable and hence τ(t) is also a continuously varying parameter, though only being "read" at algorithmic iterations k thus giving τ k τ(t k ).
The new state equations are given by where t k is therefore the absolute time of an observation and δ k is the inter-observation time.Ξ(δ k ) is the state update matrix and is given by Γ k acts in a similar manner to H k in model one and is of variable length but is a vector of ones of the same length as y k .κ k is modelled as zero mean Gaussian with covariance R k as described above.β k is modelled as zero mean Gaussian noise with covariance given as before by Bar-Shalom [32], One of the problems associated with Brownian motion is that there is no simple, closed form solution for the prediction density, p(t k |•).Thus attention is turned to an alternative method for drawing a hitting time sample of This is an iterative process and, conditional upon initial conditions, a linear prediction for the time of the new beat is made.The system is then stochastically propagated for this length of time and a new tempo and beat position found.The beat position might under or overshoot the intended location.If it undershoots, the above process is repeated.If it overshoots, then an interpolation estimate is made conditional upon both the previous and subsequent data estimates.The iteration terminates when the error on τ t falls below a threshold.At this point, the algorithm returns the hitting time t k and the new tempo τk at that hitting time.This is laid out explicitly in Algorithm 2, where Ξ i is given by and Q i by N denotes the Gaussian distribution.The interpolation mean and covariance are given by [36] where the index denotes the use of Ξ and Q from ( 33) and (34) with appropriate values of dt.Thus, we now have a method of drawing a time t k and new tempo τk given a previous state z k−1 and proposed new score (beat) location τ k .The algorithm then proceeds as be-fore with a particle filter.The posterior can be updated, thus where p(z k |z 1:k−1 ) can be factorised: Prior importance sampling [23] is used via the hitting time algorithm above to draw samples of τk and t k : (38) This leads to the weight update being given by As before in Section 5, a single beat is split into 24 subdivisions and a prior set upon these as given above in (17); 29) is modelled in the same way as ν k from ( 14) then the likelihood is Gaussian with covariance again given by R k which is diagonal and of the same dimension, j as the observation vector y k .Γ k is then a j × 1 matrix with all entries being 1.
Also as before, to explore the beat quantisation space τ 1:k effectively, each particle is predicted onward to S new positions for τ k and therefore again, a set of N × S potential particles is generated.Deterministic selection in this setting is not appropriate so resampling is used to stochastically select N particles from the N × S set.This acts instead of the traditional resampling step in selecting high probability particles.
Amplitude modelling is also included in an identical form to that described in Section 5.1 which modifies (39) to Also, the MH step described in Section 5.3 to explore different phases of the beat is used again.

RESULTS
The algorithms described above in Sections 5 and 6 have been tested on a large database of musical examples drawn from a variety of genres and styles, including rock/pop, dance, classical, folk and jazz.200 samples, averaging about one minute in length were used and a "ground truth" manually generated for each by recording a trained musician clapping in time to the music.The aim is to estimate the tempo and quantisation parameters over the whole dataset; in both models, the sequence of filtered estimates is not the best representation of this, due to locally unlikely data.Therefore, because each particle maintains its own state history, the maximum a posteriori particle at the final iteration was chosen.The parameter sets used within each algorithm were chosen heuristically; it was deemed impractical to optimise them over the whole database.Various numbers of particles N were tried though results are given below for N = 200 and 500 for models one and two, respectively.Above these values, performance continued to increase very slightly, as one would expect, but computational effort also increased proportionally.
Tracking was deemed to be accurate if the tempo was correct (interbeat interval matches to within 10%) and a beat was located within 15% of the annotated beat location. 5Klapuri [14] defines a measure of success as the longest consecutive region of beats tracked correctly as a proportion of the total (denoted "C-L" for consecutive-length).Also presented is a total percentage of correctly tracked beats (labelled "TOT").The results are presented in Table 1.It was noted that the algorithms sometimes tracked at double or half tempo in psychologically plausible patterns; also, dance music with heavy off-beat accents often caused the algorithm to track 180 o out of phase.The "allowed" columns of the table show results accepting these errors.Also shown for comparison are the results obtained using Scheirer's algorithm [7].
The current state of the art is the algorithm of Klapuri [14] with 69% success for longest consecutive sequence and 78% for total correct percentage (accepting errors) on his test database consisting of over 400 examples.Thus the performance of our algorithm is at least comparable with this.
Figure 2 shows the results for model one over the whole database graphically while Figure 3 shows the same for model two.These are ordered by style and then performance within the style category.Figure 4 shows the tempo profile for a correctly tracked example using model one; note the close agreement between the hand labelled data and the tracked tempo.

DISCUSSION
The algorithms described above have some similar elements but their fundamental operation is quite different: the Rao-Blackwellised model of Section 5 actually bears a significant resemblance to an interacting multiple models system of the type used in radar tracking [33], as many of the stages are actually deterministic.The second model, however, is much 5 The clapped signals were often slightly in error themselves.more typically a particle filter with mainly stochastic processes.Both have many underlying similarities in the model though the inference processes are significantly different.Thus, the results highlight some interesting comparisons between these two philosophies.On close examination, model two was better at finding the most likely local path through the data, though this was not necessarily the correct one in the long term.A fundamental weakness of the models is the prior on c k (or equivalently, τ k in model two) which intrinsically prefers higher tempos-doubling a given tempo places more onsets in metrically stronger positions which is deemed more likely by the prior given in (17).Because the stochastic resampling step efficiently selects and boosts high probability regions of the posterior, model two would often pick high tempos to track (150-200bpm) which accounts for the very low "raw" results.
A second problem also occurs in model two: because duplication of paths through the τ 1:k space is necessary to fully populate each quantisation hypothesis, fewer distinct paths are kept at each iteration.By comparison, the N-best selection scheme of model one ensures that each particle represents a unique c 1:k set and more paths through the state space are kept for a longer lag.This allows model one to recover better from a region of poor data.This also provides an explanation for why model one does not track at high tempo so often-because more paths though the state-space are retained for longer, more time is allowed for the amplitude process to influence the choice of tempo mode.Thus, the conclusion is drawn that the first model is more attractive: the Rao-Blackwellisation of the tempo process allows the search of the quantisation space to be much more effective.The remaining lack of performance can be accredited to four causes: the first is tracking at multiple tempo modessometimes tracking fails at one mode and settles a few beats later into a second mode.The results only reflect one of these modes.Secondly, stable tracking sometimes occurs at psychologically implausible modes (e.g., 1.5 times the correct tempo) which are not included in the results above.The third cause is poor onset detection.Finally, there are also examples in the database which exhibit extreme tempo variation which is never followed.
The result of this is a number of suggestions for improvements: firstly, the onset detection is crucial and if the detected onsets are unreliable (especially at the start of an example) it is unlikely that the algorithm will ever be able to track the beat properly.This may suggest an "online" onset detection scheme where the particles propose onsets in the data, rather than the current offline, hard decision system.The other potential scheme for overcoming this would be to propose a salience measure (e.g., [21]) and directly incorporate this into the state evolution process, thus hoping to differentiate between likely and unlikely beat locations in the data; currently, the Rao-Blackwellised amplitude process has been given weak variances and hence has little effect in the algorithm, other than to propose correct phase.The other problems commonly encountered were tempo errors by plausible ratios; Metropolis-Hastings steps [27] to explore other modes of the tempo posterior were tried but have met with little success.
Thus it seems likely that any real further improvement will have to come from music theory incorporated into the algorithm directly, and in a style-specific way-it is unlikely that a beat tracker designed for dance music will work well on choral music!Thus, data expectations and also antici-  pated tempo evolutions and onset locations would have to be worked into the priors in order to select the correct tempo.This will probably result in an algorithm with many ad-hoc features but, given that musicians have spent the better part of 600 years trying to create music which confounds expectation, it is unlikely that a simple, generic model to describe all music will ever be found.

CONCLUSIONS
Two algorithms using particle filters for generic beat tracking across a variety of musical styles are presented.One is based upon the Kalman filter and is close to a multiple hypothesis tracker.This performs better than a more stochastic implementation which models tempo as a Brownian motion process.Results with the first model are comparable with the current state of the art [14].However, the advantage of particle filtering as a framework is that the model and the implementation are separated allowing the easy addition of extra measures to discriminate the correct beat.It is conjectured that further improvement is likely to require music specific knowledge.

Figure 1 :
Figure 1: Diagram of relationships between metrical levels.

Figure 2 :
Figure 2: Results on test database for model one.The solid line represents raw performance and the dashed line is performance after acceptable tracking errors have been taken into account.(a) Maximum length correct (% of total).(b) Total percentage correct.

Figure 3 :
Figure 3: Results for model two.(a) Maximum length correct (% of total).(b) Total percentage correct.

Dave
Matthews band-best of what's around (live)

Figure 4 :
Figure 4: Tempo evolution for a correctly tracked example using model one.

Table 1 :
Results for beat tracking algorithms expressed as a total percentage averaged over the whole database.