Pitch Correlogram Clustering for Fast Speaker Identiﬁcation

Gaussian mixture models (GMMs) are commonly used in text-independent speaker identiﬁcation systems. However, for large speaker databases, their high computational run-time limits their use in online or real-time speaker identiﬁcation situations. Two-stage identiﬁcation systems, in which the database is partitioned into clusters based on some proximity criteria and only a single-cluster GMM is run in every test, have been suggested in literature to speed up the identiﬁcation process. However, most clustering algorithms used have shown limited success, apparently because the clustering and GMM feature spaces used are derived from similar speech characteristics. This paper presents a new clustering approach based on the concept of a pitch correlogram that captures frame-to-frame pitch variations of a speaker rather than short-time spectral characteristics like cepstral coe ﬃ cient, spectral slopes, and so forth. The e ﬀ ectiveness of this two-stage identiﬁcation process is demonstrated on the IVIE corpus of 110 speakers. The overall system achieves a run-time advantage of 500% as well as a 10% reduction of error in overall speaker identiﬁcation.


INTRODUCTION
Speaker recognition aims at extracting and modeling characteristics of speech data that uniquely represent a person. These characteristics should ideally be robust to channel effects and noisy environment [1]. Cepstral coefficients [2], in Mel frequency domain, are the most robust, in this sense, among all feature vectors currently employed in speech recognition systems in a Gaussian mixture model (GMM) framework [3,4]. At present, these features are also commonly employed for speaker identification, even though the best feature vector for speaker identification, as contrasted with speech recognition, is still an open problem. Recent papers suggest transformed feature vectors for performance enhancement [5] in speaker identification systems. Campbell's paper is still a very good reference to the problem and issues involved in speaker identification [6].
Speaker recognition is done at two levels: verification [7,8,9] and identification [1,10]. Verification systems are closed set operations in which a speaker's claim to be one of the enrolled speakers is verified, generally in a cooperative text prompted mode, as in voice-based access control systems.
The system conducts a binary hypothesis test, relative to the claimed identity, on the speech feature data and returns a yes/no result. Speaker identification, on the other hand, involves finding the identity of the speaker from a given test utterance. This is normally done in a text-independent manner, for example, in surveillance operations where voice channels are monitored or in other noninvasive access control applications. Identification systems, in closed set operation, identify the most likely enrolled member as the source of the utterance. In open set identification, where the utterance may or may not belong to the enrolled class, the most likely fit is further verified to return a confirmation.
A major issue in the identification problem is the GMM run-time computation, which is linear in population size. To combat this problem, systems where the enrolled speaker database is partitioned into clusters based on some proximity criteria have been proposed in the literature [11]. In these two-stage identification systems, a test utterance is first mapped into a cluster and then matched to the nearest GMM in that cluster only. This reduces the run-time computation since only a few GMMs have to be run at a time. Also, if clustering is based on features uncorrelated with the GMM feature space and robust to channel and noise distortion, the overall system error performance improves. In addition, clustering helps in better positioning of GMM priors as well. GMM computation time reduces, on an average by a factor equal to the number of clusters. Computational advantage for complete identification will be somewhat less due to cluster identification time for a test utterance. As we will see later, for large speaker populations, clustering is particularly advantageous when the average cluster size is small, the cluster size variation is low and the cluster identification is reasonably fast. Popular approaches to clustering include vector quantization (VQ) codebook-based clustering (see [12] for comparison of various VQ approaches) and covariancebased clustering (see Wang et al. [11]). Algorithms that cluster speakers in reduced dimension spaces have also been proposed [13]. These algorithms cluster speakers based on projected individual speaker data onto reduced dimension subspaces corresponding to directions of maximum variability (eigenspaces corresponding to large eigenvalues of dataset covariance) of the speaker database. This ensures fast clustering in addition to noise immunity [14].
Most clustering algorithms, however, use speaker features (MFCCs, GMM variances, etc.) based, directly or indirectly, on short-time spectrum analysis of the signal. These features, in addition to inadequate noise and channel distortion robustness, are also the ones used for later GMM identification. This can result in deterioration of performance at times. To see this, note that in the overall two-stage identification process, an error can occur because a given test utterance is either wrongly classified (mapped into the wrong cluster) in the first stage, or correctly classified but mapped onto a wrong GMM within the cluster in the second stage. Therefore, if P(e) denotes the probability of overall two-stage identification error, we have that P(e) = P(M) P(e/M) + (1 − P(M)) P(e/C), where P(M) is the probability of cluster misclassification (the first stage error) and P(e/C) is the probability of second-stage error, that is, the error in mapping onto a wrong GMM from within the correct cluster. Clearly, P(e/M) = 1 so that P(e) ≥ P(e/C). In case clustering uses the same features as the GMM, P(e/C) will be approximately the same as the probability of error in an unclustered singlestage GMM-based speaker identification system. Therefore, irrespective of the run-time advantage, overall identification performance, P(e), of a two-stage SI system will be good (low P(e)) only if the clustering feature space is independent of the GMM feature space. In fact, if a two-stage identification system yields lower P(e) than a single-stage system, it is reasonable to assume that this is due to relative independence of clustering and GMM feature spaces.
In this paper, we introduce a clustering algorithm based on speaker pitch variations for a two-stage speaker identification system. Human pitch is normally found in the 50-400 Hz range (males dominate the lower end and females the higher end). For any speaker, however, there are small (2%-10%) variations in pitch from one voiced speech frame to the next, due to random variations in the vocal fold tension [15,16]. A frame is taken normally as 20-30 milliseconds (160 to 240 speech samples at 8-kHz rate) in length for speech stationarity. While this variation is too broad to separate nearby speakers, it can group neighborhood speakers quite well. Also, current algorithms for pitch computation are very accurate and robust to noise and channel effects [16,17]. There have been earlier attempts to base speaker identification only on pitch properties, but these have not succeeded much (beyond separation of sexes) due to reasons mentioned above. Pitch contours, for example, were proposed way back in 1972 for automatic speaker recognition [18] but did not become popular. Attempts to use the lognormal character of pitch for open set speaker identification resulted in high false alarm or miss rates [19]. Algorithms using pitch along with other independent features show better results [20,21,22]; for example, it has been shown that vocaltract-features-based identification systems perform better if pitch histogram information is utilized as well [23].
In this paper, we introduce pitch correlogram as a device to measure frame-to-frame pitch variations and use it to cluster speakers. Pitch is a property of the voiced part of speech and equals the period of the impulse train excitation of the vocal tract for voiced sound production in a shorttime stationary model [24]. Given a speech tract, its silent frames are removed using a voice activity detection (VAD) algorithm, after which the voiced speech is separated from unvoiced speech using covariance thresholding, since unvoiced speech has low correlation due to its random aperiodic nature [24]. We divide the whole expected pitch region into fixed equal bands (this paper uses 40), extract pitch for each voiced frame, allocate its corresponding band, and monitor this pitch band allocation from one voiced frame to next. This information is stored in the pitch correlogram, which is a 3-dimensional matrix whose (i, j, k)th entry denotes the joint probability of pitch bands iand j-, k-frames apart. Clearly, a pitch correlogram captures local variation within histogram bands, which makes it a better source of pitch information.
In the sequel, pitch is estimated using the mixed excitation linear prediction (MELP) speech codec pitch estimator that is based on the superresolution pitch estimation algorithm [15]. This algorithm is the most accurate known at present and only O(N = samples/frame) complex. Also, it is robust to noise and channel distortion and does not depend on direct short-time Fourier transform coefficients of speech. Consequently, pitch correlograms are robust to noise as well as reasonably independent of spectral features like the MFCCs or spectral slopes. This is borne out by the performance of the proposed overall two-stage identification system. Since GMM identification is dependent on speaker population and speech SNR, using correlograms is also likely to enhance the overall system performance due to low error in classification, in addition to the computational gains from clustering.
The clustering process is described in Figure 1 and discussed in Section 3. Given reference speech utterances from enrolled speakers, we extract the pitch correlogram for each utterance and use one of these as a reference correlogram and others for training of the algorithm to fix the model size. The overall two-stage identification system is depicted in Figure 2. Given an unknown utterance, it is first mapped onto a cluster based on its correlogram and then passed through GMM models of speakers corresponding to the identified cluster. For closed set speaker identification, the utterance is mapped onto the most likely GMM, while for open set speaker identification, the nearest GMM is accepted only on further verification, for example, by thresholding the feature space a posteriori probability. This paper uses a 16mixture GMM model with the feature space of 13th-order MFCC coefficients and spectral slope for experimental purposes, as in Murthy and Heck [10]. In this paper, "clustering" means offline partitioning of the speaker database using pitch correlograms. "Classification" and/or "misclassification" refer to mapping of a test or training speech track onto a cluster and "overall identification" refers to the complete two-stage speaker identification process in which the test track is first mapped onto a cluster and then identified within the cluster by the normal GMM identification process.
The rest of the paper is organized as follows. In Section 2, we discuss the algorithms used for the estimation of pitch and pitch correlograms. Section 3 develops the clustering algorithm for an optimal number of clusters. Section 4 discusses the experimental results and comparisons with Murthy and Heck [10] results. We conclude in Section 5.

PITCH ANALYSIS
Perceptually, pitch is the attribute of auditory sensation in terms of which sounds are ordered on a musical scale. In speech processing, a simplified linear prediction model for voice production is used wherein pitch equals the period of the impulse train that excites the vocal tract (modeled as a linear system) in the voiced mode of articulation [24]. This excitation is produced by vocal cords and its periodicity is re-flected in the output voiced sound due to linear processing by the tract. Since speech is only a short-time stationary process (20-30 milliseconds), variation is observed in the pitch from frame to frame; this variation is the principal classification parameter in the sequel as discussed above.
The pitch perception models use temporal information for identifying periodicities in the signal by estimating the period of the autocorrelation function of voiced speech, given by where s(n) is the speech sample and c(l) its autocorrelation at a time lag l.
To obtain a useful set of results, the autocorrelation function is computed over a range of lag values. For periodic signals, the function attains a maximum at sample lags of 0, +/− P, +/−2 P, and so forth, where P is the pitch. This technique is most efficient at mid to low frequencies and is quite popular in speech recognition applications when the pitch range is limited. The autocorrelation function can be further used to differentiate between unvoiced and voiced frames. In the unvoiced speech, the vocal cords are not vibrating, so the resulting speech waveform is aperiodic or random in nature [24]. Therefore the autocorrelation is low for an unvoiced frame, which can be differentiated from a voiced frame using an autocorrelation threshold.
We use the MELP coder pitch extraction algorithm [25], wherein pitch is estimated first in terms of integral sample lags and later compensated for any fractional lag. The sampled speech (normally at 8 kHz) is prefiltered by a 500 Hz LPF and framed into overlapping 22.5-millisecond blocks for retaining process stationarity during autocorrelation computation. The pitch period normally is less than 20 milliseconds (160 samples) so that it is captured easily in a frame.
The normalized autocorrelation function, with lag l, is given by (2) Note that the length of the analysis window is fixed while its starting point depends on lag l. Again, since frame stationarity implies that c l (0, 0) = c l (l, l), then r(l) and c l (0, l) reach their maximum at the same lag. The normalized correlation is used, however, to threshold differentiate between unvoiced and weakly voiced frames, since otherwise unstressed phonemes or certain emotional states may result in very low correlation and be taken for unvoiced sounds. The denominator in (2) nullifies the effect of low-impulse train strength voiced frames in the autocorrelation function.
r(l) is computed over 20 to 159 sample lags (pitch in 50 to 400 Hz range) and the lag at which it is maximum is taken as the integral pitch estimate T. Actual pitch is normally at an offset from this value. To find the direction of this offset, we compute r(T − 1) and r(T + 1). If r(T − 1) > r(T + 1), we decrement the integral pitch by 1; otherwise, we leave it unchanged. Let ∆ be the offset required for the pitch period and denoted as fractional pitch. The actual pitch period P is then T+∆, but r(T+∆) cannot be calculated directly from the sampled speech signal. An interpolation is therefore used to determine r(T + ∆). Since a low-passed version of the speech signal, with bandwidth much smaller than the sampling rate, is used for pitch estimation, a convex linear interpolation of the signal suffices. Therefore, we use Since ∆ ∈ [0, 1), an optimal value of ∆ is computed by maximizing the autocorrelation between s(n) and s(n + T + ∆). As outlined in [15], the optimization can be carried out using the orthogonal projection theorem to give the value of ∆ as

Correlogram
Phonetic unit of speech is a phoneme whose intonation (pitch) varies depending on factors like stressed/unstressed vowel or syllable, accent, boundary and edge tones, neighborhood speech, and so forth. The pitch variation in a phoneme is, therefore, considerably influenced by the speaker style and accent. A pitch correlogram expresses the correlation between pitch pairs at frame distances. It captures the pitch variation within a phoneme and at phoneme junctions. Assuming that a speaker employs a reasonably unique pitch variation in pronouncing particular phonemes, the pitch correlogram can be used to capture this variational characteristic (across all phonemes and their combinations if sufficient data are available) and to group speakers with adjacent behavior. This is the main speaker clustering idea of this paper. Let S = {o 1 , o 2 , o 3 , . . .} denote the voiced speech utterance of a speaker with o i being its ith frame. From each frame o i , the pitch is extracted as discussed earlier. Human pitch P, normally between 50-400 Hz, is then quantized into uniform nonoverlapping intervals called pitch bands. Let P 1 , P 2 , P 3 , . . . , P B be the B pitch bands and P : S → {P 1 , P 2 , . . . , P B } be the utterance to pitch band map. We define where Z + is the set of positive integers and S Pj is the set of frames with pitch P j . The k(≥ 0)-delay joint distribution of pitch bands P i and P j (in specified order) is defined as As mentioned above, a pitch correlogram is a 3dimensional feature matrix of speech with (i, j, k)th entry given by (6). We will, however, use only the next-frame pitch changes (k = 1) so that our correlogram is a twodimensional matrix C, with C i j = λ Pi Pj (S). Once the voiced frame pitch estimates are made and mapped onto pitch bands, the correlogram entries C i j are given by the estimates Number of times pitch band P i is succeeded by P j Total number of voiced frames − 1 .
(7) Note that M i, j=1 C i j = 1. The number B of pitch bands used depends on performance requirement and memory and computation time tradeoffs. In our experiments, we found that 35 to 45 pitch bands give reasonably good results. As already mentioned, this paper uses 40 bands.
Since pitch varies by 2%-10% in successive frames, the maximum pitch change is about 20 Hz in the mid pitch range of about 200 Hz. This implies that a pitch band i is most likely to transit to bands i, i ± 1, and i ± 2 in the next frame. Hence C is dominant on the diagonal band. Figure 3 shows the diagonal, postdiagonal, and prediagonal distribution of three different utterances of a randomly chosen speaker. The highly correlated nature of the three correlograms suggests speaker invariance of this feature. Also, as will be seen below, correlograms tend to cluster in a normed space, which makes them an ideal speaker ensemble partitioning feature. However, they are limited in scope as an indexing (speaker identification) feature because (a) pitch quantization leads to loss of information so that different speakers can occupy the same band and (b) pitch is emotion dependent [26] so that pitch bands need to be sufficiently broad to accommodate such dependence. Since it may not always be possible to reconcile these conflicting requirements, it is not advisable to use the correlogram as the sole indexing feature.

CLUSTERING
Clustering implies grouping of objects together based on common characteristics, that is, partitioning a large database on some proximity criteria. It is a technique to understand, simplify, and interpret large amounts of multidimensional data. An ideal clustering algorithm is one that results in high intracluster correlation and low-intercluster correlation [27], that is, it generates a sharp multimodal statistical distribution with identifiable peaks and valleys much like a Gaussian mixture with separated means and comparable covariances. This may not be always possible to achieve and, in practice, algorithms that yield reasonably separated dense groups are used.
We propose a two-step clustering algorithm. In the first step, we take reference voiced speech tracks of enrolled members, generate corresponding correlograms and continuously merge nearest neighbor correlograms. This hierarchy yields as many levels of clusters as the total speaker population, that is, level 1 contains M clusters (M equals the population ensemble size), each with population 1, and level M contains 1 cluster of population M. To fix the particular level in hierarchy for the system, that is, the number and membership of clusters to be used finally, the system is trained by another set of correlograms of the same speaker population. These are called the training correlograms. The training process involves evaluation of a Bayesian risk function-in our case, not strictly convex-for each level in hierarchy relative to the set of training correlograms. The level that results in the smallest number of clusters with minimum risk is chosen as the final model level for the system. The algorithm is explained in detailed steps below.
For the correlogram space, we use the matrix norm 1 A = i, j |a i j | so that the distance metric is given by d(X, Y ) = L i, j=1 |x i j − y i j |, where X and Y are each L × L. The clustering algorithm is as follows.
(1) Generate correlograms for the speech track reference ensemble. The correlograms are labeled as C 1 1 , C 1 2 , C 1 3 , . . . , C 1 M , where M is the number of reference enrolled speakers. These correlograms are generated on (at least) a 1-minute voice frame track of available speech for each speaker. This is called level-1 clustering, where each cluster has a population of 1 speaker.
(2) Compute pairwise correlogram distances at level N; merge the cluster pair with the least distance into a single cluster and calculate its new representative correlogram (for (N + 1)th level) as C (N+1) that is, for all speakers, use some utterances for training the system (these could come from the main ensemble or from outside). Let T N i represent the set of training correlograms belonging to speakers from cluster i at level N. We use the same number of training correlograms for each speaker. This implies that at level 1, all T 1 i have equal number of correlograms in their respective sets. If C N s and C N t are merged to create level N + 1, then the training correlogram set for the new cluster, say T N+1 r , will be the union of the training correlograms of the underlying sets, that is, T N+1 r = T N Ns ∪ T N Nt . (5) Evaluate the following Bayesian risk function at each level (level N has M − N + 1 clusters): Bayesian risk is the expected value of a random cost variable, given in our case by R i j as below: P i is the probability of cluster i, which, for equiprobable speakers, equals the relative cluster size P N i . The cost of all wrong decisions is equal, while correct decisions are rewarded in inverse proportion to cluster size. This is done in order to reduce the computational complexity (Section 3.1) of the overall speaker identification system. Let P N (i, j) be the probability that an utterance from cluster i is mapped onto cluster j, that is, the training correlogram from cluster i turns out to be closest to the representative correlogram for cluster j. P N (i, j) can be estimated as Number of training utterances from T N i mapped into cluster j Total number of training utterances .
(6) The level at which the risk is the least represents an optimal number of clusters and, in case there are multiple global minima, we use the one with the smallest cluster size.
Experimental results (Section 5) show that the algorithm meets with the properties of a good practical clustering algorithm. Note also that once the algorithm has been executed, each cluster is represented by a single correlogram, which is the mean of all its element correlograms.

Computational complexity
As mentioned earlier, in the two-stage speaker identification (see also Figure 2) process, an unknown utterance is first mapped onto a cluster and then the speaker from the cluster is identified using standard GMM identification. The overall complexity of the two-stage identification process can be computed in terms of complexity of its subprocesses and compared with a single-stage GMM only identification process. First, some notations are mentioned.
(1) α denotes the computational complexity of calculating a pitch correlogram. Clearly, the complexity of a single-stage GMM-based identification process equals Mγ. In the two-stage scenario, the computational complexity of the first stage, that of finding the cluster to which an unknown speaker belongs equals α+N 0 β. In the second stage, for a given identified cluster with population N i , the complexity equals N i γ. Therefore, the expected value of computation in the overall two-stage process is given by P i is the probability of cluster i and equals N i /M when all the speakers are equiprobable, in which case (11) simplifies to The complexity is minimum when all clusters are equal, that is, N i = M/N 0 , for all i. In this case it is Mγ/N 0 + N 0 β + α, which, under the reasonable assumption that β γ, is much smaller than Mγ. This computational advantage makes a two-stage speaker identification system preferable to a singlestage one.
Observe that fewer clusters implies large average cluster size, which in turn, leads to small misclassification (mapping of a test/unknown utterance to a wrong cluster) probability and large GMM computational load. The converse is equally true. Computationally, the worst and the best cases  would occur when there are one and M clusters, respectively, but classification errors follow exactly the reverse path. The optimum number is actually a trade-off between misclassification error and total computational load. At the optimal point, the decrease in misclassification error (further merging) does not compensate for the increase in computation caused by larger average cluster size. Note that, at any level in the clustering hierarchy, when the number of clusters is N 0 , the mean cluster size equals M/N 0 and the cluster size sample variance equals Equation (12) can be represented in terms of sample variance as Clearly for a given number of clusters, the computation complexity is a function of cluster size sample variance. If this variance is large (a case encountered fairly often), so the complexity is. Therefore, clustering algorithms that yield small cluster size sample variance achieve better complexity advantage. In general, the cluster size sample variance tends to be large, because some clusters are likely to deviate much from the mean in the algorithm. Therefore, computational complexity of the system can be further reduced only if the clustering algorithm is such that it limits the deviation of the cluster sizes.

EXPERIMENTAL RESULTS
The database used in the study is the IVIE corpus, publicly available www.phon.ox.ac.uk/∼esther/ivyweb/download1. html. It has 110 speakers (55 male and 55 female) with 12 utterances from every speaker, each of 15-to 60-second duration.
The performance curves for the clustering algorithm are given below. Figure 4 gives the probability of misclassifica-   tion versus the number of voiced frames used in training for 36 speakers. The probability of misclassification for N clusters equals N i, j=1, i = j P N (i, j). It shows asymptotic behavior that reaches a tractable minimum at around 1 000 frames. This suggests that sufficient statistic for clustering is grabbed in about 1 000 voiced speech frames. Figure 5 shows for 110 speakers the probability of misclassification versus the number of clusters for 20 × 20 and 40 × 40 pitch correlograms. As the number of clusters becomes larger-for example, when the data base size increases-the performance difference between the two diverges. This is because smaller pitch bands capture more local information that can differentiate nearby clusters.
The primary aim of introducing clusters in the speaker identification system is to decrease the computational complexity. However, since the overall SI system performance (probability of overall speaker identification error) needs to be kept above a threshold, the probability of misclassification should be small. One approach to this problem is to map an unknown speech utterance onto two nearest neighbors (see Figure 6). The improved performance, however, comes at the expense of running two-cluster GMMs for SI.  complexity of the two-stage identification system with respect to a single-stage one (dashed lines), when the numbers of speakers is 110, 62, and 36, respectively. The optimum number of clusters is 18, 15, and 9 for 110, 62, and 36 speakers, respectively and is clearly not a linear function of the ensemble size. This is because some new speakers get classed into present clusters themselves.
Clearly, the computation required varies inversely relative to the number of clusters. At the optimum point, the computational complexity of the proposed two-stage system is only 0.2 times the computational complexity of a single-stage system that uses the same GMM feature space (see Figure 7).
For a given number of clusters, the computational complexity of the two-stage system (see (12)) is minimum when all the clusters have the same population. However, since the cluster formation is data dependent, the cluster sizes are often different. As suggested earlier, the intracluster distance should be small. To test the algorithm's efficiency from this point of view, the largest cluster obtained was taken as new seed population and the clustering algorithm was reapplied afresh to this subensemble of 18 speakers. The process returned 4 clusters as optimal ( Figure 10) showing the robustness of the algorithm. Therefore, speakers close in the correlogram space get clustered together (intracluster distance is small) in our process, which is a major desired behavior of any clustering process as discussed in Section 3. Table 1 shows the misclassification error obtained at different SNRs at the optimal cluster level. Note that the     Again, it is well known that identifying females is tougher than identifying males. Table 2 shows that our algorithm is not biased in favor of either sex in terms of performance.
To test the relative performance of the one-and two-stage SI systems, individual speaker GMMs with 16 mixtures with a feature space of 13th-order MFCC coefficients and spectral slope were trained and then run in a closed set framework. The results are listed in Table 3. The two-stage system returned the overall speaker identification error 10% lower than the single-stage one. It also used only about 23% of the computation time. The performance improvement suggests the relative independence of the pitch correlogram and GMM feature space. The proposed two-stage SI system is efficient in both the performance and computation time dimensions.
The apparently high error rates (34% in single-stage and 24% in two-stage systems) are because of the small feature space dimensionality (13 MFCC and spectral slope) used. Normally, a minimum of 39 features (13 MFCC, 13 delta and 13 acceleration coefficients) is used. Still, errors tend to be high, for example, Reynolds reports around 26 percent error for telephone quality speech with this dimensionality [2]. The feature space used in this paper is the same as the one used by Murthy and Heck [10], who obtained an error of 42% on telephone quality speech with 64 GMMs on 100 speakers (we use the clean speech of the IVRS database and only 16 GMMs for our experiments). Our aim is to reduce computation time without performance compromise (and noise robustness) and this will certainly hold better on larger feature space dimensionality.

CONCLUSION
This paper suggests using a pitch-correlogram-based frontend database classifier to speed up text-independent SI systems based on Gaussian mixture models. We have shown that, in addition to a large computational advantage, better robustness to noise and distortion and a better overall performance are ensured by clustering. This is due to robustness as well as relative independence of the clustering statistic from the GMM space. The run-time advantage is particularly significant for surveillance and access control SI systems where live, or near-live, identification is desirable.
The pitch correlograms are computationally inexpensive and can be easily implemented on a real-time platform. Though the pitch correlogram has been used with GMMs in this paper, the framework allows these to be directly integrated with any SI system for computational gain and increased robustness. The effectiveness of the pitchcorrelogram-based classifier has been demonstrated in a closed set SI system. It can also be used with similar effect in an open set SI solution.
The clustering algorithm yields the optimal number of clusters in Bayesian framework. Dynamic enrolment and distortion channel issues for clustering are under study and will be reported elsewhere.