EURASIP Journal on Applied Signal Processing 2003:10, 1016–1026 c ○ 2003 Hindawi Publishing Corporation Model-Based Speech Signal Coding Using Optimized Temporal Decomposition for Storage and Broadcasting Applications

A dynamic programming-based optimization strategy for a temporal decomposition (TD) model of speech and its application to low-rate speech coding in storage and broadcasting is presented. In previous work with the spectral stability-based event localizing (SBEL) TD algorithm, the event localization was performed based on a spectral stability criterion. Although this approach gave reasonably good results, there was no assurance on the optimality of the event locations. In the present work, we have optimized the event localizing task using a dynamic programming-based optimization strategy. Simulation results show that an improved TD model accuracy can be achieved. A methodology of incorporating the optimized TD algorithm within the standard MELP speech coder for the efficient compression of speech spectral information is also presented. The performance evaluation results revealed that the proposed speech coding scheme achieves 50%–60% compression of speech spectral information with negligible degradation in the decoded speech quality.


INTRODUCTION
While practical issues such as delay, complexity, and fixed rate of encoding are important for speech coding applications in telecommunications, they can be significantly relaxed for speech storage applications such as store-forward messaging and broadcasting systems. In this context, it is desirable to know what optimal compression performance is achievable if associated constraints are relaxed. Various techniques for compressing speech information exploiting the delay domain, for applications where delay does not need to be strictly constrained (in contrast to full-duplex conversational communication), are found in the literature [1,2,3,4,5]. However, only very few have addressed the issue from an optimization perspective. Specifically, temporal decomposition (TD) [6,7,8,9,10,11], which is very effective in representing the temporal structure of speech and for removing temporal redundancies, has not been given adequate treatment for optimal performance to be achieved. Such an optimized TD (OTD) algorithm would be useful for speech coding applications such as voice store-forward messaging systems, and multimedia voice-output systems, and for broadcasting via the internet. Not only would it be useful for speech coding in its own right, but research in this direction would lead to a better understanding of the structural properties of the speech signal and the development of improved speech models which, in turn, would result in improvement in audio processing systems in general.
TD of speech [6,7,8,9,10,11] has recently emerged as a promising technique for analyzing the temporal structure of speech. TD is a technique of modelling the speech parameter trajectory in terms of a sequence of target parameters (event targets) and an associated set of interpolation functions (event functions). TD can also be considered as an effective technique of decorrelating the inherent interframe correlations present in any frame-based parametric representation of speech. TD model parameters are normally evaluated over a buffered block of speech parameter frames, with the block size generally limited by the computational complexity of the TD analysis process over long blocks. Let y i (n) be the ith speech parameter at the nth frame location. The speech parameters can be any suitable parametric representation of the speech spectrum such as reflection coefficients, log area ratios, and line spectral frequencies (LSFs). It is assumed that the parameters have been evaluated at close enough frame intervals to represent accurately even the fastest of speech transitions. The index i varies from 1 to I, where I is the total number of parameters per frame. The index n varies from 1 to N, where n = 1 and n = N are the indices of the first and last frames of the speech parameter block buffered for TD analysis. In the TD model of speech, each speech parameter trajectory, y i (n), is described aŝ whereŷ i (n) is the approximation of y i (n) produced by the TD model. The variable φ k (n) is the amplitude of the kth event function at the frame location n and a ik is the contribution of the kth event function to the ith speech parameter. The value K is the total number of speech events within the speech block with frame indices 1 ≤ n ≤ N. It should be noted that the event functions φ k (n)'s are common to all speech parameter trajectories (y i (n), 1 ≤ i ≤ I) and therefore provide a compact and approximate representation, that is, a model, of speech. Equation (1) can be expressed in vector notation asŷ where a k = a 1k a 2k · · · a Ik T , where a k is the kth event target vector, andŷ(n) is the approximation of y(n), the nth speech parameter vector, produced by the TD model of speech. Note that φ k (n) remains a scalar since it is common to each of the individual parameter trajectories. In matrix notation, (2) can be written aŝ where the kth column of matrix A contains the kth event target vector, a k , and the nth column of the matrixŶ (approximation of Y) contains the nth speech parameter frame,ŷ(n), produced by the TD model. Matrix Y contains the original speech parameter block. In the matrix Φ, the kth row contains the kth event function, φ k (n). It is assumed that the functions φ k (n)s are ordered with respect to their locations in time. That is, the function φ k+1 (n) occurs later than the function φ k (n). Each φ k (n) is supposed to correspond to a particular speech event. Since a speech event lasts for a short time (temporal), each φ k (n) should be nonzero only over a small range of n. Event function overlapping normally occurs between close by events in time, while events that are far apart in time have no overlapping at all. These characteristics ensure the matrix Φ to be a sparse matrix with number of nonzero terms in the nth column indicating the number of event functions overlapping at the nth frame location [6]. Thus, significant coding gains can be achieved by encoding the information in the matrices A and Φ instead of the original speech parameter matrix Y [6,11,12]. The results of the spectral stability-based event localizing (SBEL) TD [9,10] and Atal's original algorithm [6] for TD analysis show that event function overlapping beyond two adjacent event functions occurs very rarely, although in the generalized TD model overlapping is allowed to any extent. Taking this into account, the proposed modified model of TD imposes a natural limit to the length of the event functions. We have shown that better performance can be achieved through optimization of the modified TD model. In previous TD algorithms such as SBEL TD [9,10] and Atal's original algorithm [6], event locations are determined using heuristic assumptions. In contrast, the proposed OTD analysis technique makes no a priori assumptions on event locations. All TD components are evaluated based on errorminimizing criteria, using a joint optimization procedure. Mixed excitation LPC vocoder model used in the standard MELP coder was used as the baseline parametric representation of the speech signal. Application of OTD for efficient compression of MELP spectral parameters is also investigated with TD parameter quantization issues and effective coupling between TD analysis and parameter quantization stages. We propose a new OTD-based LPC vocoder with detail coder performance evaluation, both in terms of objective and subjective measures. This paper is organized as follows. Section 2 introduces the modified TD model. An optimal TD parameter evaluation strategy based on the modified TD model is presented in Section 3. Section 4 gives numerical results with OTD. The details of the proposed OTD-based vocoder and its performance evaluation results are reported in Sections 5 and 6, respectively. The concluding remarks are given in Section 7.

MODIFIED TD MODEL OF SPEECH
The proposed modified TD model of speech restricts the event function overlapping to only two adjacent event functions as shown in Figure 1. This modified model of TD can be described aŝ y(n) = a k φ k (n) + a k+1 φ k+1 (n), n k ≤ n < n k+1 , (5) n k n k+1 Time index (n) φ k+1 (n) a k+1 a k φ k (n) Figure 1: Modified temporal decomposition model of speech. The speech parameter segment n k ≤ n < n k+1 is represented by a weighted sum (with weights φ k (n) and φ k+1 (n) forming the event functions) of the two vectors a k and a k+1 (event targets). Vertical lines depict the speech parameter vector sequence.
where n k and n k+1 are the locations of the kth and (k + 1)th events, respectively. All speech parameter frames between the consecutive event locations n k and n k+1 are described by these two events. Equivalently, the modified TD model can be expressed aŝ where φ k (n) = 0 for n < n k−1 and n ≥ n k+1 . In the modified TD model, each event function is allowed to be nonzero only in the region between the centers of the proceeding and succeeding events. This eliminates the computational overhead associated with achieving the time-limited property of events in the previous TD algorithms [6,9,10]. The modified TD model can be considered as a hybrid between the original TD concept [6] and the speech segment representation techniques proposed in [1]. In [1], a speech parameter segment between two locations n k and n k+1 is simply represented by a constant vector (centroid of the segment) or by a first-order (linear) approximation. A constant vector approximation of the form y(n) = nk+1−1 n=nk y(n) n k+1 − n k , for n k ≤ n < n k+1 , provides a single vector representation for a whole speech segment. However, this representation requires the segments to be short in length in order to achieve a good speech parameter representation accuracy. A linear approximation of the formŷ(n) = na + b requires two vectors (a and b) to represent a segment of speech parameters. This segment representation technique captures the linearly varying speech segments well and is similar to the linear interpolation technique report in [13]. The proposed modified model of TD in (5) provides a further extension to speech segment representation, where each speech parameter vector y(n) is described as the weighted sum of two vectors a k and a k+1 , for n k ≤ n < n k+1 . The weights φ k (n) and φ k+1 (n) for the nth speech parameter frame form the event functions of the traditional TD model [6]. It is shown that the simplicity of the proposed modified TD model allows the optimal evaluation of the model parameters, thus resulting in an improved modelling accuracy.

Speech parameter sequence
Parameter buffering Buffered block of speech parameters TD analysis TD parameters Figure 2: Buffering of speech parameters into blocks is a preprocessing stage required for TD analysis. TD analysis is performed on block-by-block basis with TD parameters calculated for each block separately and independently.

OPTIMAL ANALYSIS STRATEGY
This section describes the details of the optimization procedure involved with the evaluation of the TD model parameters based on the proposed modified model of TD described in Section 2.

Speech parameter buffering
TD is a speech analysis modelling technique, which can take advantage of the relaxation in the delay constraint for speech signal coding. TD generally requires speech parameters to be buffered over long blocks for processing, as shown in Figure 2. Although the block length is not fundamentally limited by the speech storage application under consideration, the computational complexity associated with processing long speech parameter blocks imposes a practical limit on the block size, N. The total set of speech parameters, y(n), where 1 ≤ n ≤ N, buffered for TD analysis is termed a block (see Figures 3). The series of speech parameters, y(n), where n k ≤ n < n k+1 , is termed a segment. TD analysis is normally performed on a block-by-block basis, and for each block, the event locations, event targets, and event functions are optimally evaluated. For optimal performance, a buffering technique with overlapping blocks is required to ensure a smooth transition of events at the block boundaries. Sections 3.2 through 3.5 give the details of the proposed optimization strategy for a single block analysis. Details of the overlapping buffering technique for improved performance are given in Section 3.6.

Event function evaluation
The proposed optimization strategy for the modified TD model of speech has the key feature of determining the optimum event locations from all possible event locations. This guarantees the optimality of the technique with respect to the modified TD model. Given a candidate set of locations, {n 1 , n 2 , . . . , n K }, for the events, event functions are determined using an analytical optimization procedure. Since the modified TD model of speech considered for optimization places an inherent limit on event function length, the event functions can be evaluated in a piece-wise manner. In other words, the parts of event functions between the centers of consecutive events can be calculated separately as described below. The remainder of this section describes the computational details of this optimum event function evaluation task. Assume the locations n k and n k+1 of two consecutive events are known. Then, the right half of the kth event function and the left half of the (k + 1)th event function can be optimally evaluated by using a k = y(n k ) and a k+1 = y(n k+1 ) as initial approximations for the event targets. The initial approximations of event targets are later on iteratively refined as described in Section 3.5. The reconstruction error, E(n), for the nth speech parameter frame is given by where n k ≤ n < n k+1 . By minimizing E(n) against φ k (n) and φ k+1 (n), we obtain where n k ≤ n < n k+1 . Therefore, the modelling error, E(n), for each spectral parameter, y(n), in a segment can be evaluated by using (5) and (6). Total accumulated error, E seg (n k , n k+1 ), for a segment becomes Therefore, given the event locations n 1 , n 2 , . . . , n K for a parameter block, 1 ≤ n ≤ N, the total accumulated error for the block can be calculated as where n 0 = 0, n K+1 = N + 1, and E(0) = 0. The first segment, 1 ≤ n < n 1 , and the last segment, n K ≤ n < N, of a speech parameter block, 1 ≤ n ≤ N, should be specifically analyzed taking into account the fact that these two segments are described by only one event, that is, first and Kth events, respectively. This is achieved by introducing two dummy events located at n 0 = 0 and n K+1 = N + 1, with target vectors a 0 and a K+1 set to zero, in the process of evaluating E seg (1, n 1 ) and E seg (n K , N), respectively.

Optimization of event localization task
The previous subsection described the computational procedure for evaluating the optimum event functions, {φ 1 (n), φ 2 (n), . . . , φ K (n)}, and the corresponding accumulated modelling error for a block of speech parameters, E block (n 1 , n 2 , . . . , n K ), for a given candidate set of event locations, {n 1 , n 2 , . . . , n K }. The procedure relies on the initial approximation of {y(n 1 ), y(n 2 ), . . . , y(n K )} for the event target set {a 1 , a 2 , . . . , a K }. Section 3.4 will describe a method of refining this initial approximation of the event target set to obtain an optimum result in terms of the speech parameter reconstruction accuracy of the TD model. With the above knowledge, the optimum event localizing task could be formulated as follows. Given a block of speech parameter frames, y(n), where 1 ≤ n ≤ N, and the number of events, K, allocated to the block (this determines the resolution, event/s, of the TD analysis), we need to find the optimum locations of the events, {n * 1 , n * 2 , . . . , n * K }, such that E block (n 1 , n 2 , . . . , n K ) is minimized, where n k ∈ {1, 2, . . . , N} for 1 ≤ k ≤ K and n 1 < n 2 < · · · < n K . The minimum accumulated error for a block can be given as It should be noted that E * block versus K/N describes the ratedistortion performance of the TD model.

Dynamic programming formulation
A dynamic programming-based solution [14] for the optimum event localizing task can be formulated as follows. We define D(n k ) as the accumulated error from the first frame of the parameter block up to the kth event location, n k , Also note that The minimum of the accumulated error, E * block , can be calculated using the following recursive formula: for k = 1, 2, . . . , K +1, where D(n 0 ) = 0. And the corresponding optimum event locations can be found using for k = 1, 2, . . . , K + 1, where R k−1 is the search range for the (k − 1)th event location, n k−1 . Figure 4 illustrates the dynamic programming formulation. For a full search assuring the global optimum, the search range R k−1 will be the interval between n k−2 and n k : The recursive formula in (15)  of D(n 1 ) for all possible n 1 can be calculated. Substitution of k = 2 in (15) gives where R 1 = {n | n 0 < n < n 2 }. Using (18), D(n 2 ) can be calculated for all possible n 1 and n 2 combinations. This procedure (Viterbi algorithm [15]) can be repeated to obtain D(n k ) sequentially for k = 1, 2, . . . , K + 1. The final step with k = K + 1 gives D(n K+1 ) = E block (n 1 , n 2 , . . . , n K ) and the corresponding optimal locations for n 1 , n 2 , . . . , n K (as given by (14)). Also, by decreasing the search range R k−1 in (17), a desired performance versus computational cost trade-off can be achieved for the event localizing task. However, results reported in this paper are based on full search range, thus guarantee the optimum event locations.

Refinement of event targets
The optimization procedure described in Sections 3.2 through 3.4 determines the optimum set of event functions, {φ 1 (n), φ 2 (n), . . . , φ K (n)}, and the optimum set of event locations, {n 1 , n 2 , . . . , n K }, based on the initial approximation of {y(n 1 ), y(n 2 ), . . . , y(n K )}, for the event target set, {a 1 , a 2 , . . . , a K }. We refine the initial set of event target to further improve the modelling accuracy of the TD model. Event target vectors, a k 's, can be refined by reevaluating them to minimize the reconstruction error for the speech parameters. This refinement process is based on the set of event functions determined in Section 3.4. Consider the modelling error E i , for the ith speech parameter trajectory within a block, given by where y i (n) and a ki are the ith element of the speech parameter vector, y(n), and the event target vector, a k , respectively. The partial derivative of E i with respect to a ri can be calculated as  Therefore, setting the above partial derivative to zero, we obtain where 1 ≤ r ≤ K and 1 ≤ i ≤ I. Equation (21) gives I sets of K simultaneous equations with K unknowns, which can be solved to determine the elements of the event target vectors, a ki 's. This refined set of event targets can be iteratively used to further optimize the event functions and event locations using the dynamic programming formulation described in Section 3.4.

Overlapping buffering technique
If no overlapping is allowed between adjacent blocks, spectral error will tend to be relatively high for the frames near the block boundaries. This is due to the fact that first and last segments, 1 ≤ n ≤ n 1 and n K ≤ n ≤ N, are only described by a single event target instead of two, as described in Section 3.2.
The block overlapping technique effectively overcomes this problem by forcing each transmitted block to start and end at an event location. During analysis, the block length N is kept fixed. Overlapping is introduced so that the location of the first frame of the next block coincides with the location of the last event of the present block, as shown in Figure 5. This makes each transmitted block length slightly less than N, but their starting and end frames now coincide with an event location. Block length N determines the algorithmic delay introduced in analyzing continuous speech.

Speech data and performance measure
A speech data set consisting of 16 phonetically diverse sentences from the TIMIT 1 speech database was used to evaluate the modelling performance of OTD. MELP [16] spectral parameters, that is, LSFs, calculated at 22.5-millisecond frame intervals were used as the speech parameters for TD analysis.
The block size was set to N = 20 frames (450 milliseconds). The number of iterations was set to 5 as further iteration only achieves negligible (less than 0.01 dB) improvement in TD model accuracy. Spectral distortion (SD) [13] was used as the objective performance measure. The spectral distortion, D n , for the nth frame is defined in dB as 10 log S n e jω −10 log Ŝ n e jω 2 dω dB, where S n (e jω ) andŜ n (e jω ) are the LPC power spectra corresponding to the original spectral parameters y(n) and the TD model (i.e., reconstructed) spectral parametersŷ(n), respectively.

Performance evaluation
One important feature of the OTD algorithm is its ability to freely select an arbitrary number of events per block, that is, average number of events per second (event rate). This was not the case in previous TD algorithms [9,10,11], where the number of events was limited by constraints such as spectral stability. Average event rate, also called the TD resolution, determines the reconstruction error (distortion) of the TD model. The event rate, e rate , can be given as where f rate is the base frame rate of the speech parameters. Lower distortion can be expected for higher TD resolution and vice versa. But higher resolution implies a lower compression efficiency from an application point of view. This rate-distortion characteristic of the OTD algorithm is quite important for coding applications, and simulations were carried out to determine it. Average SD was evaluated for the event rates of 4, 8, 12, 16, 20, and 24 event/s. Figure 6 shows an example of event functions obtained for a block of speech. Figure 7 shows the average SD versus event rate graph. The base frame rate point, that is, 44.4 frame/s, is also shown for reference. The significance of the frame rate is that if the event rate is made equal to the frame rate (in this case 44.44 event/s), theoretically the average SD should become zero. This is the maximum possible TD resolution and corresponds to a situation where all event functions become unit impulses spaced at frame intervals and event target values exactly equal the original spectral parameter frames. As can be seen, an average event rate of more than 12 event/s is required if the OTD model is to achieve an SD less than 1 dB. It should be noted that at this stage, TD parameters are unquantized, and therefore, only modelling error accounts for the average SD.

Performance comparison with SBEL-TD
In SBEL-TD algorithm [10], event localization is performed based on the a priori assumption of spectral stability and does not guarantee the optimal event locations. Also, SBEL-TD incorporates an adaptive iterative technique to achieve the temporal nature (short duration of existence) of the event functions. In contrast, the OTD algorithm uses the modified model of TD (temporal nature of the event functions is an inherent property of the model) and also uses the optimum locations for the events. In this section, the objective performance of the OTD algorithm is compared with that of the SBEL-TD algorithm [10] in terms of speech parameter modelling accuracy.
OTD analysis was performed on the speech data set described in Section 4.1, with the event rate set to 12 event/s (N = 20 and K = 5). SBEL-TD analysis was also performed on the same spectral parameter set with the event rate approximately set to the value of 12 event/s (for a valid comparison between the two TD algorithms, the same value of event rate should be selected). Spectral parameter reconstruction accuracy was calculated using SD measure for the two algorithms. Table 1 shows the average SD and the percentage number of outlier frames for the two algorithms. As can be seen from the results in Table 1, the OTD algorithm achieved a significant improvement in terms of the speech parameter modelling accuracy. Also, the percentage number of outlier frames has been reduced significantly in the OTD case. These improvements of the OTD algorithm are critically important for speech coding applications. As reported in [12], SBEL-TD fails to realize good-quality synthesized speech because the TD parameter quantization error increases the postquantized average SD and the number of outliers to unacceptable levels. With a significant improvement in speech parameter modelling accuracy, OTD has a greater margin to accommodate the TD parameter quantization error, resulting in goodquality synthesized speech in coding applications. Sections 5 and 6 give the details of the proposed OTD-based speech coding scheme and the coder performance evaluation, respectively.

Coder schematics
The mixed excitation LPC model [17] incorporated by the MELP coding standard [16] achieves good-quality synthesized speech at the bit rate of 2.4 kbit/s. The coder is based on a parametric model of speech operating at 22.5-millisecond speech frames. The MELP model parameters can be broadly categorized into the two groups of (1) excitation parameters that model the excitation, that is, LPC residual, to the LPC synthesis filter and consist of Fourier magnitudes, gain, pitch, bandpass voicing strengths, and aperiodic flag; (2) spectral parameters that represent the LPC filter coefficients and consist of the 10th-order LSFs.
With the above classification of MELP parameters, the MELP encoder can be represented as shown in Figure 8. The proposed OTD-based LPC vocoder uses the LPC excitation modelling and parameter quantization stages of the MELP coder, but uses block-based (i.e., delayed) OTD analysis and OTD parameter quantization for the spectral parameter encoding instead of the multistage vector quantization (MSVQ) [15] stage of the standard MELP coder. This proposed speech encoding scheme is shown in Figure 9. The underlying concept of the speech coder shown in Figure 9 is that it exploits the short-term redundancies (interframe and intraframe correlations) present in the spectral parameter frame sequence (line spectral frequencies), using TD modelling, for efficient encoding of spectral information at very low bit rates. The OTD algorithm was incorporated. The frame-based MSVQ stage of Figure 8 only accounts for the redundancies present within spectral frames (intraframe correlations), while the TD analysis quantization stage of Figure 9 accounts for both interframe and intraframe redundancies present in spectral parameter sequence, and therefore, is capable of achieving significantly higher compression ratios. It should be noted that the concept of TD can be used to exploit the short-term redundancies present in some of the LPC excitation parameters also using block mode TD analysis. However, some preliminary results of applying OTD to LPC excitation parameters showed that the achievable coding gain is not significant compared to that for the LPC spectral parameters. Figure 10 gives the detail schematic of the TD modelling and quantization stage shown in Figure 9. The first stage is to buffer the spectral parameter vector sequence using a block size of N = 20 (20 × 22.5 = 450 milliseconds). This introduces a 450-millisecond processing delay at the encoder. OTD is performed on the buffered block of spectral parameters to obtain the TD parameters (event targets and event functions). The number of events calculated per block (N = 20) is set to K = 5 resulting in an average event rate of 12 event/s. The event target and event function quantization techniques are described in Section 5.2. The quantization code-book indices are transmitted to the speech decoder. Improved performance in terms of spectral parameter reconstruction accuracy can be achieved by coupling the TD analysis and TD parameter quantization stages as shown in Figure 10. The event targets from the TD analysis stage are  Figure 10: Proposed spectral parameter encoding scheme based on the OTD. For improved performance, coupling between the TD analysis and the quantization stage is incorporated. refined using the quantized version of the event functions in order to optimize the overall performance of the TD analysis and TD parameter quantization stages.

Event function quantization
One choice for quantization of the event function set, { φ 1 , φ 2 , . . . , φ K }, for each block is to use vector quantization (VQ) [15] on individual event functions, φ k 's, in order to exploit any dependencies in event function shapes. However, the event functions are of variable length ( φ k extending from n k−1 to n k+1 ) and therefore require normalization to a fixed length before VQ. Investigations showed that the process of normalization-denormalization itself introduces a considerable error which gets added to the quantization error. Therefore, we incorporated a frame-based 2dimensional VQ for event functions which proved to be simple and effective. This was possible only because the modified TD model allows only two event functions to overlap at any frame location. Vectors φ k (n) φ k+1 (n) were quantized individually. The distribution of the 2-dimensional vector points of φ k (n) φ k+1 (n) showed significant clustering, and this dependency was effectively exploited through the frame-level VQ of the event functions. Sixty-two phonetically diverse sentences from TIMIT database resulting in 8428 LSF frames were used as the training set to generate the code books of sizes 5, 6, 7, 8, and 9 bit using the LBG k-means algorithm [15].

Event target quantization
Quantization of the event target set, {a 1 , a 2 , . . . , a K }, for each block was performed by vector quantizing each target vector, a k , separately. Event targets are 10-dimensional LSFs, but they differ from the original LSFs due to the iterative refinement of the event targets incorporated in the TD analysis stage. VQ code books of sizes 6, 7, 8, and 9 bit were generated using the same training data set described in Section 5.2.1 using the LBG k-means algorithm [15].

Objective quality evaluation
Spectral parameters can be synthesized from the quantized event targets,â k 's, and quantized event functions,φ k 's, for each speech block aŝ whereŷ(n) is the nth synthesized spectral parameter vector at the decoder, synthesized using the quantized TD parameters. Note that double-hat notation is used here for spectral parameters as the single-hat notation is already used in (5) to denote the spectral parameters synthesized using the unquantized TD parameters. The average error between the original spectral parameters, y(n)'s, and the synthesized spectral parameters,ŷ(n)'s, calculated in terms of average SD (dB) was used to evaluate the objective quality of the coder. The final bit rate requirement for spectral parameters of the proposed compression scheme can be expressed in number of bit per frame as where n 1 and n 2 are the sizes (in bit) of the code books for the event function quantization and event target quantization, respectively. The parameter n 3 denotes the number of bit required to code each event location within a given block. For the chosen block size (N = 20) and the number of events per block (K = 5), the maximum possible segment length (n k+1 − n k ) is 16. Therefore, the event location information can be losslessly coded using differential encoding with n 3 = 4.  Figure 11: Average SD against bit rate for the proposed speech coder with coupled TD analysis and TD parameter quantization stages. Code-book size for event target quantization, n 2 , is depicted as (n 2 ). from the speech data set used for VQ code book training in Section 5.2. The SD between the original spectral parameters and the reconstructed spectral parameters from the quantized TD parameters (given in (24)) was used as the objective performance measure. This SD was evaluated for different combinations of the event function and event target codebook sizes. The event location quantization resolution was fixed at n 3 = 4 bit. Figure 11 shows the average SD (dB) for different n 1 and n 2 against the bit rate B. Figure 11 shows the average SD (dB) against the bit rate requirement for spectral parameter encoding in bit/frame. Standard MELP coder uses 25 bit/frame for the spectral parameters (line spectral frequencies). In order to compare the rate-distortion performances of the proposed delay domain speech coder and the standard MELP coder, the SD analysis was performed for the standard MELP coder also using the same speech data set. Table 2 shows the results of this analysis. For comparison, the SD analysis results obtained for the proposed coder with TD parameter quantization resolutions of n 1 = 7 and n 2 = 9 are also shown in Table 2.

Performance comparison
In comparison to the 25 bit/frame of the standard MELP coder, the proposed coder operating at n 1 = 7 and n 2 = 9 results in a bit rate of 10.25 bit/frame. This signifies over 50% compression of bit rate required for spectral information at the expense of 0.4 dB of objective quality (spectral distortion) and 450 milliseconds of algorithmic coder delay.

Subjective quality evaluation
In order to back up the objective performance evaluation results, and to further verify the efficiency and the applicability of the proposed speech coder design, subjective performance evaluation was carried out in terms of listening tests. The 5point degradation category rating (DCR) scale [18] was utilized as the measure to compare the subjective quality of the proposed coder to that of the standard MELP coder.

Experimental design
Six different operating bit rates of the proposed speech coder with coupling between TD analysis and TD parameter quantization stages ( Figure 10) were selected for subjective evaluation. Table 3 gives the 6 selected operating bit rates together with the corresponding quantization code-book sizes for the TD parameters and the objective quality evaluation result. It should be noted that the speech coder operating points given in Table 3 have the best rate-distortion advantage within the grid of TD parameter quantizer resolutions (Figure 11), and are therefore selected for the subjective evaluation. Sixteen nonexpert listeners were recruited for the listening test on volunteer basis. Each listener was asked to listen to 30 pairs of speech sentences (stimuli), and to rate the degradation perceived in speech quality when comparing the second stimulus to the first in each pair. In each pair, the first stimulus contained speech synthesized using the standard MELP coder and the second stimulus contained speech synthesized using the proposed speech coder. The six different operating bit rates given in Table 3 of the proposed coder, each with 5 pairs of sentences (including one null pair) per listener, were evaluated. Therefore, a total of 30 (6×5) pairs of speech stimuli per listener were used. The null pairs containing the identical speech samples as the first and the second stimuli were included to monitor any bias in the one-sided DCR scale used.

Results and analysis
The 30 pairs of speech stimuli consisting of 5 pairs of sentences (including 1 null pair) from each of the 6 operating bit rates of the proposed speech coder were presented to the 16 listeners. Therefore, a total of 64 (16 × 4) votes (DCRs) were obtained for each of the 6 operating bit rates, R 1 to R 6 . Table 4 gives the DCR obtained for each of the 6 operating bit rates of the proposed speech coder. It should be noted that the degradation was measured in comparison to the subjective quality of the standard MELP coder. Degradation mean opinion score (DMOS) was calculated as the weighted average of the listener ratings, where the weighting is the DCR values (1-5). As can be seen from the DMOSs in Table 4, the proposed speech coder achieves a DMOS of over 4 for the operating bit rates of R 1 to R 4 . This corresponds to a compression ratio of 51% to 63%. Therefore, the proposed speech coder achieves over 50% compression of the bit rate required for spectral encoding at a negligible degradation (in between not perceivable or perceivable but not annoying distortion levels) of the subjective quality of the synthesized speech. DMOS drops below 4 for the bit rates of R 5 and R 6 , suggesting that on average the degradation in the subjective quality of synthesized speech becomes perceivable and annoying for compression ratios over 63%.

CONCLUSIONS
We have proposed a dynamic programming-based optimization strategy for a modified TD model of speech. Optimum event localization, model accuracy control through TD resolution, and overlapping speech parameter buffering technique for continuous speech analysis can be highlighted as the main features of the proposed method. Improved objective performance in terms of modelling accuracy has been achieved compared to the SBEL-TD algorithm, where the event localization is based on the a priori assumption of spectral stability. A speech coding scheme was proposed, based on the OTD algorithm and associated VQ-based TD parameter quantization techniques. The MELP model was used as the baseline parametric model of speech with OTD being incorporated for efficient compression of the spectral parameter information. Performance evaluation of the proposed speech coding scheme was carried out in detail. Objective performance evaluation was performed in terms of log SD (dB), while the subjective performance evaluation was performed in terms of DMOS calculated using DCR votes. The DCR listening test was performed in comparison to the quality of the standard MELP synthesized speech. These evaluation results showed that the proposed speech coder achieves 50%-60% compression of the bit rate requirement for spectral parameter encoding for a little degradation (in between not perceivable and perceivable but not annoying distortion levels) of the subjective quality of decoded speech. The proposed speech coder would find useful applications in voice store-forward messaging systems, multimedia voice output systems, and broadcasting.