Sub-band based group delay segmentation of spontaneous speech into syllable-like units

— In the development of a syllable-centric Automatic Speech Recognition (ASR) system, segmentation of the acoustic signal into syllabic units is an important stage. Although the short-term energy (STE) function contains useful information about syllable segment boundaries, it has to be processed before segment boundaries can be extracted. This paper presents a sub-band based group delay approach to segment spontaneous speech into syllable-like units. This technique exploits the additive property of the Fourier transform phase and the deconvolution property of the cepstrum to smooth the STE function of the speech signal and make it suitable for syllable boundary detection. By treating the STE function as a magnitude spectrum of an arbitrary signal, a minimum phase group delay function is derived. This group delay function is found to be a better representative of the STE function for syllable boundary detection. Although the group delay function derived from the STE function of the speech signal contains segment boundaries, the boundaries are difﬁcult to determine in the context of long silences, semi-vowels, and fricatives. In this paper, these issues are speciﬁcally addressed and algorithms are developed to improve the segmentation performance. The speech signal is ﬁrst passed through a bank of three ﬁlters, corresponding to three different spectral bands. The STE functions of these signals are computed. Using these three STE functions, three minimum phase group delay functions are derived. By combining the evidence derived from these group delay functions, the syllable boundaries are detected. Further, a multi-resolution based technique is presented to overcome the problem of shift in segment boundaries during smoothing. Experiments carried out on the Switchboard and OGI-MLTS corpora show that the error in segmentation is at most 25 ms for 67% and 76.6% of the syllable segments, respectively.


I. INTRODUCTION
One of the major reasons for considering the syllable as a basic unit for ASR systems is its better representational and durational stability relative to the phoneme [1].The syllable was proposed as a unit for ASR as early as 1975 [2], in which irregularities in phonetic manifestations of phonemes were discussed.It was argued that the syllable will serve as an effective minimal unit in the time-domain.In [3], it is demonstrated that segmentation at syllable-like units followed by isolated style recognition of continuous speech performs well.
Researchers have tried different ways of segmenting the speech signal either at the phoneme level or at the syllable level, with or without the use of phonetic transcription.These segmentation methods can further be classified into two cat-egories, namely, time-domain based methods, where shortterm energy function, zero-crossing rate, etc. are used and frequency-domain based methods, where short-term spectral features are used.
In [4], a loudness function, defined as the time-smoothed and frequency-weighted summation of the signal spectrum is used for segmenting speech into syllabic units.Syllable boundaries are placed at local minima in the loudness function, subject to various conditions.
A syllabification procedure developed in [5] for German, makes an initial estimate of syllable boundaries based on voicing, energy level, and place of articulation and then locates syllables based on more detailed acoustic analysis.
In [6], for Japanese, a syllable level segmentation technique is proposed, which is based on a common syllable model.The segment boundaries are detected by finding the optimal HMM state sequence.
In [7], a multi-layered neural network structure for continuous speech recognition, based on isolation and identification of syllables, is presented.The syllable boundaries are detected at the first layer of a neural network, which is an adaptation of Kohonen's phonotopic feature map trained by unsupervised learning.
A short-term energy based method for detecting syllable nuclei is presented in [8].In this work, the speech is first band-pass filtered and then the short-term magnitude function is computed.To suppress the ripples caused by ¢¡ or transient phonemes, the short-term magnitude function is further lowpass filtered at approximately 10 Hz.The peaks of the resulting energy contour are declared as the syllable nuclei.
In [9], a temporal flow model (TFM) network has been developed to extract syllable boundary information from continuous speech, where TFM captures the time varying properties of the speech signal.
The syllable is structurally divisible into three parts, the onset, nucleus, and coda [10].Although many syllables contain all three elements, a significant number contain either one or two.With rare exceptions, when a single component is present, it is the nucleus.Generally, the nucleus is vocalic, while the onset and coda are usually consonantal in form.In terms of short-term energy function, the syllable can be viewed as an energy peak in the nucleus region and it tapers off at both the ends of the nucleus where a consonant may be present, which results in local energy fluctuations.If these local energy fluctuations are smoothed out, then the valleys at both ends of the syllable nucleus can be considered as syllable boundaries.
Many languages of the world possess a relatively simple syllable structure consisting of several canonical forms [10].Most of the syllables in such languages contain just two phonetic segments, typically of CV type (for example, Japanese language).The remaining syllabic forms are generally of V or VC variety.In contrast, English and German possess a more highly heterogeneous syllable structure.In such forms, the onset and/or coda constituents often contain two or more consonants.But a salient property shared in common by stress and syllable-timed languages is the preference for CV syllabic forms in spontaneous speech.Nearly half of the forms in English and over 70% of the syllables in Japanese are of this variety.There is also a substantial proportion of CVC syllables in spontaneous speech of both the languages [10].The analysis done on the Switchboard corpus shows that nearly 88% of the syllables are of simple structure and only 12% of the syllables are of a more complex structure with consonant clusters [10].This shows that even for the languages which are not syllabletimed, the syllable can be defined using a simple structure.Further, the definition of syllable in terms of short-term energy function is suitable for almost all the languages, in the case of spontaneous speech.Keeping this fact in mind, in this paper, a time-domain based speech segmentation procedure is described, which segments the speech signal into syllablelike units, without the knowledge of the phonetic transcription.This approach is somewhat similar to homomorphic filtering, which essentially smoothes the magnitude spectrum of the windowed speech signal.
Earlier a method was proposed [11] for segmenting the acoustic signal into syllable-like units, in which a minimum phase signal is derived from the short-term energy function as if it were a magnitude spectrum.It is observed that the group delay function of this minimum phase signal is a better representative of the short-term energy function to perform segmentation.Later, several refinements have been made to improve the performance of the baseline segmentation algorithm [12].In this paper, we specifically discuss the refinements made on the system described in [11].

II. SHORT-TERM ENERGY BASED SEGMENTATION
A simple candidate for segmenting speech is the shortterm energy (STE) function of the speech signal.The high energy regions in the STE function correspond to syllable nuclei and the valleys at both ends of the syllable nuclei are approximately the syllable boundaries.But the raw STE function can not be directly used to perform segmentation due to significant local energy fluctuations.This is due to the presence of transient consonants and ¡ (see Figure 1(b)).Techniques like fixed thresholding can be used but suffer when energy variation across the signal is quite high.For continuous speech, especially for spontaneous speech, the energy is quite high at the beginning of a phrase and tapers off towards the end of the phrase.An adaptive thresholding can be used to address this problem but the threshold value will have to be learnt continuously from the speech signal.Further the region over which the adaptive threshold is computed will become crucial: too large a region will miss boundaries, while too short a region will generate spurious boundaries.
To overcome the problems due to local energy fluctuations, the STE function should be smoothed.Smoothing the STE function can be performed in several ways.Firstly, the STE function can be computed with increased window size, but with the consequence of shift in boundary locations.The STE function is normally mean smoothed with a narrow window size (see Figure 1(d)).In this case, the order of mean-smoothing is crucial.If the order is large, it will result in significant shift in boundaries or even missed detection of boundaries altogether, while if the order is small, it will not serve the purpose of smoothing, properly.In [13], it is mentioned that the syllable duration can be conceptualized in terms of modulation frequency.For example, a syllable duration of 200 ms is equivalent to a modulation frequency of 5 Hz.Further, the syllable duration analysis [10] performed on the Switchboard corpus [14], shows that the duration of syllables mostly varies from 100 ms to 300 ms with a mean of 200 ms.In terms of modulation frequency, it varies from 3 Hz to 10 Hz, with a mean of 5 Hz.Using this approach, in [8], a low-pass filter with cut-off frequency of 10 Hz is applied on the logarithmic STE amplitude to suppress the ripples caused by ¡ or transient consonants.This forces the system to oscillate at syllable frequencies (see Figure 1(c)).The selection of cut-off frequency is crucial; it should be different for different speech rates.
In this paper, an attempt is made to overcome these issues.The STE function is a non-zero, positive function.But, the magnitude spectrum of any real signal has the symmetry property, i,e., If the STE function is symmetrized, it will have the properties similar to that of the magnitude spectrum.Therefore, techniques applied for processing the magnitude spectrum can be applied to the energy function.The IDFT of this assumed magnitude spectrum will be a two-sided signal (the real cepstrum).If the causal portion of this signal alone is considered, it is a perfect minimum phase signal since it is derived from the magnitude spectrum alone.Now, smoothing of this assumed magnitude spectrum can be performed using one of the following techniques.

Cepstrum based smoothing :
It is well established that, high frequency ripples can be removed by applying a lifter in the cepstral domain, thereby, retaining the lowfrequency ripples alone [15].Using the same analogy, in our work, the symmetrized STE function is treated as if it were a magnitude spectrum of an arbitrary signal.The low-frequency oscillations in the STE function correspond to the syllable rate and the high-frequency oscillations or ripples correspond to the presence of transient consonants and ¡ .The high-frequency ripples in the STE function can be removed as is done in homomorphic filtering.
Cepstrum-LP based smoothing : By choosing a proper order, which is based on the number of syllables present in the speech signal, the cepstrum can be modeled.
Root Cepstrum based smoothing : In [16], it is shown that spectral root homomorphic deconvolution performance is similar to, or even better than the Log homomorphic deconvolution, where the root cepstrum is defined by the IDFT ¡ ¢ ¤£ ¥ ¨ § ¡ ¡ , with ¢ ¤£ ¦¥ §£ ¨£ © It has been well established in the literature that minimum phase group delay functions are very useful in formant extraction [17].In the present work also, instead of deriving a new magnitude spectrum from the cepstrum, group delay functions are derived as explained in the following Section.

III. GROUP DELAY BASED SEGMENTATION OF SPEECH
The negative derivative of the Fourier transform phase is defined as group delay.The group delay function exhibits additive properties.If then the group delay function !£ ¦¥ ¨ § can be written as, From Equations 2 and 3, observe that a multiplication in the spectral domain becomes an addition in the group delay domain.To demonstrate the power of the additive property of the group delay spectrum, three different systems are chosen (Figure 2(a), 2(d), and 2(g)), where the first system consists of a complex conjugate pole pair at an angular frequency ¥ , the second system with a complex conjugate pole pair at an angular frequency ¥ and the third with two complex conjugate pole pairs one at ¥ and the other at ¥ . From the magnitude spectra of these three systems (Figure 2(b), 2(e) and 2(h)), it is observed that even though the peaks in Figure 2(b) and Figure 2(e) are clearly visible, in a system where these two poles are combined together, the peaks are not resolved well as shown in Figure 2(h).This is due to the multiplicative property of the magnitude spectra.But from Figure 2(c), Figure 2(f) and Figure 2(i), it is evident that the group delay spectrum obtained by combining the poles together, the peaks are well resolved as shown Figure 2(i).Further, in the group delay spectrum of any Resolving power of group delay spectrum: z-plane, magnitude spectrum and group delay spectrum of the cases when I) a pole inside the unit circle at 9 A@ 0B C 0D FE HG IC P , II) a pole inside the unit circle at 9 Q@ 0B C 0D FE HG SR TP and III) a pole at 9 A@ 0B C 0D UE VG IC P and another pole at 9 A@ 0B C 0D UE VG SR TP , inside the unit circle.signal, the peaks (poles) and valleys (zeros) will be resolved properly only when the signal is a minimum phase signal.In our work, since the signal is derived from the positive function (which is similar to the magnitude spectrum), it can be shown that the resultant signal is a minimum phase signal.We have exploited the minimum phase property of the signal derived from any positive function and the additive property of the group delay function to segment the speech into syllable-like entities.

A. The Minimum phase property of the magnitude spectrum
Consider a system function, X(z) given below: The square of the magnitude of the system frequency response is given by where 'c' denotes complex conjugation.
From Equation 6, we can infer that, for every pole in X(z), there is a pole in C(z) at % a and ¢¡ h .Consequently, if one element of each pair is outside the unit circle, then the conjugate reciprocal will lie inside the unit circle [18].Since the Fourier transform of Equation 6 exists, inverse Z-transform of Equation 6 leads to : where From Equation 8, we conclude that, the causal portion of the inverse Fourier transform of the squared magnitude spectrum of a signal whose root is at ' % a ' or ' h ' , with ¡ % a ¡ £ © , will have a root at ' % a ', i.e., the resultant signal will always be a minimum phase signal.But, since a window is applied in the cepstral domain, the root cepstrum is of finite length.Because of this, the z-transform of the signal will have spurious zeros.These zeros may affect the positions of the actual zeros present in the signal.To overcome this problem, the squared magnitude spectrum can be inverted £ © 1 £ ¡ ¢ ¤£ e f qp § ©¡ § and another minimum phase signal can be derived using the same algorithm, if zeros are of interest.
Instead of taking the squared magnitude spectrum, in fact, we can take ¡ ¢ ¤£ e f qp § ¡ ¡ , where ¥ can be any value 1 .If the signal # £ ¤ § is an energy bounded signal, from the Akhiezer-Krein and Fejer-Reisz theorems [19], it can be shown that, where £ and 1 denote complex conjugation and convolution operations, respectively.Thus ¡ ¢ ¤£ e f qp § ¡ can be expressed as the Fourier transform of the autocorrelation of some sequence 0 £ ¥¤ § . Basically, the root cepstrum of any signal # £ ¤ § can be thought of as the autocorrelation of some other sequence 0 £ ¥¤ § .

B. Algorithm for segmentation
In [17], [20] it is shown that if the signal is minimum phase, the group delay function resolves the peaks and valleys of the spectrum well.If the STE function is thought of as a magnitude spectrum, an equivalent minimum phase signal can be derived, as explained in Section III(A).The peaks and valleys of the group delay function of this signal will now correspond to the peaks and valleys in the STE function.In the STE function of any syllable, the energy is quite high in the voiced region and tapers off at both the ends, where a consonant may be present, which results in local energy fluctuations.If these local variations are smoothed, then the 1 Other values of 4 say, 4 65 87 are especially useful in formant and antiformant extraction from the speech signal when the dynamic range is very high.minima at both ends of a voiced region correspond to syllable boundaries.The algorithm for segmentation of continuous speech using this approach is given below, which essentially smoothes the energy contour and removes the local energy fluctuations.
Let # £ ¤ § be the given digitized speech signal (Figure 3 Compute the minimum phase group delay function of the windowed causal sequence of e £ ¥¤ 2g § ( [20], [17]).Let this sequence be 9 ih ¢p £ ¥d § . Let the size of the window applied on this causal sequence, i.e., the size of the cepstral lifter, be B r .

Detect the positive peaks in the minimum phase group delay function (9 h qp £ rd §
) as given below.If 9 h ¢p £ ¥d § is positive and if then, 9 h ¢p £ ¥d § is considered as a peak.These peaks approximately correspond to the syllable boundaries.
As explained in Section II, for a given speech signal # £ ¤ § (Figure 4(a)), group delay function may be derived in three different ways.The group delay function shown in Figure 4(b) is derived using the root cepstrum based approach.The group delay functions derived using the other two methods, i.e., ceptsrum and cepstrum-LP based smoothing methods, are also given in Figure 4(c) and (d), along with the group delay function derived using root-cepstrum based smoothing.Interestingly, all the three group delay functions are almost similar, except for slight shifts in boundary locations in the case of cepstrum-LP based smoothing.But, each method has its own advantages and disadvantages.In the cepstrum and root-cepstrum based smoothing, the group delay functions are exactly similar in shape.But the computation of the conventional cepstrum requires a log operation.The common problem with these two methods, is the choice of the cepstral lifter size (B r ).Appropriate choices for this parameter are discussed in the next section.If we use cepstrum-LP based method, the cepstral lifter size is not crucial and in fact, the whole causal portion of the cepstrum can be considered for prediction.Even though, this seems to be very attractive, this methods suffers from the fact that the choice of the predictor order is related to the number of boundaries.

C. Choice of B r
The frequency resolution in the magnitude spectrum as well as in the group delay spectrum depends on the size of cepstral lifter (B r ) applied in the root-cepstrum.Here, B r is defined as, B r ¡ e ¤ ) £¢ ¥¤ §¦ ©¨¤ ¦ !' ¢ ¢ e ' @ e ¤ e ' ") 0 ¤ £ ¢ ¦ ¤ ¤ ¦ ¨£ % £ e % £ ¢ !¦ !' £ #" $ § In Equation 12, the length of the STE function correspond to the number of samples in the STE function and the Window Scale Factor (WSF) represents a scaling factor which is used to truncate the cepstrum.In this context, the value of WSF is always greater than 1.If B r is high, the resolution will also be high, i.e., it can resolve two closely spaced boundaries.
If B r is chosen to be high, a boundary will appear between CV/CVC at the CV transition.For syllable segmentation, this is undesirable.On the other hand, if the resolution is too low, even syllables will not be resolved, which is also not desirable.To choose B r appropriately, durational analysis was performed.For this analysis, about 5000 speech dialogs of the Switchboard data [14] were considered.Table I gives the durations of a subset of syllables in Switchboard data.From The analysis shows that when the WSF is varied from 4 to 10, the number of syllable boundaries detected is equal to the number of actual boundaries.Based on this experiment, the window scale factor in the computation of B r can be set between 4 and 10.The number of samples in the STE function is directly related to the number of syllables present in the speech signal.In a few instances, the syllable duration may be more than 300 ms or less than 100 ms.If the syllable duration is more than 300 ms, then that particular segment may be split into two segments.Similarly, if the syllable duration is less than 100 ms, there is a chance that the syllable boundary is not resolved.But most importantly, other syllable boundaries remain unaffected.

IV. SILENCES, FRICATIVES, AND SEMI-VOWELS
The group delay function resolves even very closely spaced poles well when they are separated by a zero, provided the zero is located at approximately the same radius as that of the poles.In other cases, there may be some degradation in performance.Three possible places where failure may occur are, (i) at the silence region, where the duration of the silence is considerable.(ii) at fricative segments, where the energy of the fricative is quite high and (iii) at the semivowels, when it comes in the middle of any word.To overcome these problems, on advice from Steven Greenberg at ICSI [21], a sub-band based approach to syllable segmentation is attempted.

A. Presence of long-silences
In this approach, since the symmetrized energy contour is inverted, any drastic energy reduction in between two syllables is considered as a pole in the z-domain and a positive peak in the group delay domain.But, for a long-silence (see Figure 5(a)) in between two syllables (say more than about 30 ms) this rule may not apply.Instead we may get more than one boundary in the group delay domain, depending upon the resolution (Figure 5(b)).Syllable boundaries correspond to poles in the group delay domain.The long-silence is equivalent to having two or more consecutive poles with identical radii.To overcome this problem, the silence segments present in the continuous speech, whose duration is high, namely, about 30 ms, should be removed.Based on the knowledge derived from the energy, zero-crossing rate and spectral flatness of a frame, the decision is made whether that frame of signal is silence or speech.If the duration of the silence is more than 30 ms, that particular segment is removed from the signal (see Figure 5(c)) and then processed.The resultant peaks in the group delay spectrum now correspond to correct segment boundaries.This process reduces the spurious segment boundaries (Figure 5(d)).

B. Presence of fricatives
In the speech signal, # £ ¤ § , if a fricative is present (Figure 6(a)), when we compute the energy function, a boundary will be generated at the middle of a fricative.This will manifest in the group delay domain also, which is a spurious peak (see the 3rd and 4th peaks in Figure 6(b)).To avoid this, the signal, # £ ¥¤ § , is low-pass filtered to remove the high frequency fricatives.Observe that the energy of the signal in the fricative regions are significantly reduced (Figure 6(c)).Consequently, in the group delay spectrum too, the spurious peak/boundary is removed (Figure 6(d)).This results in the segment boundary being slightly shifted.So the group delay function derived from this should not be considered as the reference.Nevertheless, it can be used to remove peaks due to fricatives in the original group delay (Figure 6(d)).

C. Presence of a semivowel
The semivowels are very similar to vowels in that they have periodic, intense waveforms with most energy in the low formants.Even though they are slightly weaker than vowels, if they come in the middle of a word in continuous speech, in most cases, a visible energy reduction may not be perceived (see Figure 7(a)).Because of this, in the group delay spectrum too, we may not get a boundary in between two vowels when they are separated by a semivowel (see the three vertical lines drawn in Figure 7 and the intersecting points (1), ( 2) and (3) in Figure 7(b)).For example, in the word envelope, since there is no significant energy reduction in between the syllables /ve/ and /lope/, in the group delay spectrum too, the peak is not present (see the intersecting point (1) in Figure 7(b).If a suitable band-pass filter is applied to the original signal, since the energy of the semivowels are concentrated at low formants, the semivowels will be attenuated severely (see Figure 7(c)) without affecting the vowel regions much.This will ensure that a boundary will be present at the semivowel segment also ( see the points/peaks (1), ( 2) and (3) in Figure 7(d)).

D. Refining segment boundaries
The boundaries derived from the group delay based algorithm may have slight deviations from the actual boundaries, for example, in the nasal consonant regions (Figure 8(a) and (b)).This is due to lower resolution.If the resolution of the (1) group delay spectrum is increased by increasing the cepstral lifter size, B r , applied in the cepstral domain, a spurious segment is observed at the beginning of a nasal consonant.Nevertheless, when the resolution is increased, the error in the segment boundary is small (Figure 8(c)).Each boundary location in the lower resolution group delay spectrum is compared with all the peaks in the higher resolution group delay spectrum and the nearest peak is considered as the actual segment boundary.

E. Combining evidence
Instead of using the group delay function derived from the STE function of the original signal alone, here, the speech signal is passed through a bank of three filters.The group delay functions of the outputs of each of these three filters is computed.The basic steps involved in this approach for segmenting the speech signal at syllable-like units is given in the block diagram (Figure 9).The boundaries derived from the different group delay functions are combined using the following logic.
where ' represents 'OR' operation and represents the difference operation (i.e., only magnitude of the time difference is considered).For example, the speech signal for the utterance "group delay based segmentation" (Figure 10(a)) is considered to describe the method of combining evidence.First, the silences in between the syllables, if any, are removed.In Figure 10, the solid vertical lines drawn between Figure 10(b) and (c) denote the segment boundaries detected after combining the evidence from the group delay functions of all-pass and lowpass filtered speech signals using Equation 13.The dashedline between Figure 10(b) and (c) (labeled as "1") denotes the spurious boundary, which is removed after combining.The solid vertical line drawn between Figure 10(b) and (d) denotes the new boundary detected after combining the evidence from the group delay functions of all-pass and band-pass filtered speech signals using Equations 14 and 15.The dotted vertical lines drawn from Figure 10(e) to Figure 10(a) denote the boundaries detected after refinement (refer Equation 16) using the higher resolution group delay function derived from the all-pass filtered signal.Observe that, a spurious segment boundary produced at the fricative region is removed after lowpass filtering the signal and a new boundary is detected (as indicated by the solid vertical line with label "2") in between the syllables /de/ and /lay/ because of band-pass filtering the signal.
/group/ /de/ /lay/ /based/ /seg/ /men/ /tesh/ /an/ Fig. 10.(a) Speech signal (b) Group delay function derived from the all-pass filtered signal (c) Group delay function derived from low-pass filtered ( ¡ = 500Hz) signal (d) Group delay function derived from the band-pass filtered ( = 500Hz and ¡ = 1500Hz) signal (e) Group delay function with higher resolution derived from the all-pass filtered signal.

A. Speech corpora
The Switchboard corpus [14] and OGI-MLTS corpus [22] are used for analyzing the performance of our system.Switchboard is a corpus of several thousands of informal speech dialogs recorded over the telephone.For our analysis, a portion of the corpus, which has syllable level transcription, is considered.For these speech dialogs, syllable level transcription [13] is also provided in this corpus.The duration of the speech signals varies from 0.5 s to 25 s.In OGI-MLTS, 40 speech files uttered by 40 different speakers are considered for the analysis.In this subset, each file is of 45 s duration.These files are manually segmented into syllabic units and used as a reference to verify the performance of our segmentation approach.

B. Experimental setup
Prior to automatic segmentation, the speech signals are first pre-processed by removing the long silences (if any) as explained in Section IV(A).For the computation of short-term energy function, overlapped rectangular windows are used, where the window length is of duration 20 ms and the overlap is of 10 ms duration.Further, the value of ¥ in 9 £ @ § ¡ is set to 0.001 to reduce the dynamic range of the short-term energy function, irrespective of the speech corpus considered.In fact, any value of ¥ £ ¢ ¢ © has been found to be appropriate.
As defined in Section III(C), the window scale factor (WSF) used to compute the size of the Hanning window (cepstral lifter size B r ), is set to 4.0.Since the value of the WSF is fixed, the length of the root cepstrum is proportional to the length of the STE function.Three different group delay functions are computed from (a) the original speech signal (allpass filtered) (b) low-pass filtered ( r = 500 Hz) speech signal and (c) band-pass filtered ( £¢ = 500 Hz and ¥¤ = 1500 Hz) speech signal.The evidence derived from these group delay functions are combined as explained in Section IV(E).In order to see the effect of each of the group delay functions in the performance of the final system, four different experiments are carried out separately on the Switchboard corpus and the results are tabulated (see Table II).In all these experiments, a boundary is said to be detected, if the error between an automatic segmentation boundary and manual segmentation boundary is less than 80 ms.Based on the error, four different categories are observed (see 1st column of Table II).In each of these four categories, the performance (in %) is calculated by computing the ratio between the number of boundaries in each category and the total number of automatically detected boundaries.II) and the number of deletions is also reduced when  II).The error in segmentation boundaries are found to be greatly reduced when  II).The performance of the final system on Switchboard data is compared with the performance on OGI data (see Table III).The performance of the final system on Tamil data of OGI corpus is found to be better than that of Switchboard corpus.The better performance for the language Tamil may be due to its simple syllable structure.

Fig. 4 .
Fig. 4. Group delay based segmentation -an example (a) Speech signal (b) Group delay function derived from root-cepstrum (c) Group delay function derived from cepstrum-LP (d) Group delay function derived from conventional cepstrum.

Fig. 5 .
Fig. 5. (a) Speech signal for the alphanumeric string '1258abdg' with silence (b) Group delay function derived from signal given in (a) (c) Speech signal after removing long silences (d) Group delay function derived from signal given in (c).

Fig. 6 .
Fig. 6.(a) Speech signal (b) Group delay function derived from the signal given in (a) (c) Low-pass filtered ( ¢¡ = 500 Hz) signal given in (a) (d) Group delay function derived from signal given in (c).

Fig. 7 .
Fig. 7. (a) Speech signal (b) Group delay function derived from the signal given in (a) (c) Band-pass filtered ( ¡ = 500 Hz and £¢ = 1500 Hz) signal given in (a) (d) Group delay function derived from signal given in (c).

Fig. 8 .
Fig. 8. (a) Speech signal for the utterance of the digit string ''1 2 9 10' (b) Group delay function derived from signal given in (a) with lower resolution (WSF = 2.5) (c) Group delay function derived from signal given in (a) with higher resolution (WSF = 1.2).

¤¥
§¦ $&! are combined with ¤ ¥ ) (see 5th column of Table alone) the number of insertions and deletions are very high (see 5th and 6th row of 2nd column of TableII).The number of insertions is considerably reduced when ¤ ¥ ¦ © and ¤ ¥ §¨© are combined (see 3rd column of Table