Channel Effect Compensation in LSF Domain

This study addresses the problem of channel e ﬀ ect in the line spectrum frequency (LSF) domain. LSF parameters are the popular speech features encoded in the bit stream for low bit-rate speech transmission. A method of channel e ﬀ ect compensation in LSF domain is of interest for robust speech recognition on mobile communication and Internet systems. If the bit error rate in the transmission of digital encoded speech is negligibly low, the channel distortion comes mainly from the microphone or the handset. When the speech signal is represented in terms of the phase of inverse ﬁlter derived from LP analysis, this channel distortion can be expressed in terms of the channel phase. Further derivation shows that the mean subtraction performed on the phase of inverse ﬁlter can minimize the channel e ﬀ ect. Based on this ﬁnding, an iterative algorithm is proposed to remove the bias on LSFs due to channel e ﬀ ect. The experiments on the simulated channel distorted speech and the real telephone speech are conducted to show the e ﬀ ectiveness of our proposed method. The performance of the proposed method is comparable to that of cepstral mean normalization (CMN) in using cepstral coe ﬃ cients.


INTRODUCTION
Channel distortion is always a serious problem in speech recognition systems. Channel distortion may drastically degrade the performance of speech recognition [1,2,3]. The channel effect in the cepstral domain has been extensively studied. Many approaches have been proposed for eliminating the influence of channel distortion to speech recognition performance [4,5,6,7,8,9]. However, few studies aim at the channel effect in the line spectrum frequency (LSF) domain. LSFs are usually the parameters used for low bit-rate speech transmission (e.g., ITU-T G.723.1, G.728, G.729, TIA IS-96, IS-127, . . .). A speech or speaker recognition algorithm based on LSFs is of interest in mobile communication and Internet systems [10,11,12,13,14]. Although the LSF parameters show the poor performance in a large vocabulary continuous speech recognition (LVCSR) system, they can obtain comparable performance as cepstral coefficients do in connected digits recognition or small vocabulary speech recognition systems [12,13]. Since the LSF parameters can be extracted directly from the bit stream of encoded speech, they are the very promising features for speech recognition in some simple applications.
The effect of codec process is another factor to influence the speech quality [15]. Since the encoded speech parameters are the only available information we can use, it is hard to compensate this nonlinear channel effect. If the bit error rate in the transmission of encoded speech is negligibly low, the channel distortion comes mainly from the microphone or the handset. In this study, we deal with only the linear channel distortion due to transducers. However, the effect of codec process on recognition performance will be evaluated for comparison.
LSFs are alternative representations of linear prediction coefficients (LPCs) and have been extensively used in speech coding and synthesis [16,17,18,19]. The use of LSFs directly extracted from the encoded bit stream for speech recognition is preferred since it will become unnecessary to decode the encoded speech into a waveform [10,13,14]. Some researches have reported that features obtained in this way are more robust in adverse environments than those from decoded speech waveform [10,20].
In this study, we formulate the speech signal in terms of inverse filter derived from linear prediction (LP) analysis. When the speech signal is represented by the phase of inverse filter, the channel distortion can be expressed in terms of the channel phase [21]. Further derivation shows that the mean subtraction performed on the phase of inverse filter can minimize the channel effect. Based on this finding, an iterative algorithm is proposed to remove the bias on LSFs due to channel effect.
Two series of experiments are conducted herein. The first series of experiments use simulated channel distorted speech to examine the channel effect on a digital communication system due to the handset distortion and the effect of codec process. The second series of experiments are performed on a real telephone speech to demonstrate the effectiveness of the proposed method. The experimental results show that the performance degradation caused by the codec process is worse than that by the handset distortion. The combination of the codec process and handset distortion yields the worst performance. Nevertheless, the proposed method yields significant improvements in the performance of speech recognition.
This paper is organized as follows. Section 2 briefly reviews the fundamentals of LSFs. Section 3 describes the channel effect on the phase of inverse filter and in the LSF domain. Section 4 introduces the mean normalization on the phases of inverse filters to minimize the channel effect. An iterative algorithm is then derived for removing the bias on LSFs due to channel effect. Section 5 illustrates some experimental results to show the effectiveness of our proposed methods. Section 6 draws the conclusion.

Linear prediction
In LP analysis, the speech production is modeled as a discrete-time equation, where a(1), a(2), . . . , a(M) are the LPCs, M is the system order, e(n) is the excitation source, and G is the gain of the excitation. Equation (1) in the z-domain is where is the inverse filter, and X(z) and E(z) are the signal and the excitation, respectively. The G/A(z) is called the LP model, and is often used to characterize the spectral envelope of a speech signal.

Line spectrum frequencies
LSFs can be obtained from the LP model by defining a symmetrical polynomial P(z) and an antisymmetrical polynomial Q(z) in terms of the inverse filter A(z): The zeros of P(z) and Q(z) are on the unit circle and are interlaced. These zeros are complex conjugates and their angles are the LSFs. LSF can also be computed by formulating a ratio filter as In radian frequency, the phase of the ratio filter is given by where φ(ω) and θ(ω) represent the phase of ratio filter R(e jω ) and the phase of inverse filter A(e jω ), respectively. The LSFs are frequencies at which the phase of ratio filter is equal to a multiple of π-radians; that is, Therefore, (6) provides another approach for calculating LSFs. In this study, (6) and (7) serve as the basis to investigate the channel effect.

Channel effect on the phase of ratio filter
For a speech signal x(n), the channel distorted signal is expressed as y(n) = x(n) * h(n) in time domain, where h(n) is the impulse response of the channel H(z). By expressing the speech signal and the distorted signal in terms of inverse filters, we obtain the following relation: where A y (z) and A x (z) are the inverse filters of the channel distorted speech y(n) and the original speech x(n), respectively; G y and G x are the gains in the LP analysis of y(n) and x(n), respectively. In radian frequency, the phase of inverse filter A y (e jω ) is expressed by where θ x (ω) and θ h (ω) are the phases of A x (e jω ) and H(e jω ), respectively. By the definition of (6), the phase of ratio filter for y(n) is expressed as where φ x (ω) is the phase of ratio filter for x(n). This equation indicates that the channel effect causes a bias to the phase of ratio filter. Figure 1 shows an example of the channel effect on the power spectrum and the phase of ratio filter.

Channel effect on LSFs
Starting from the channel effect on the phase of ratio filter, we want to derive the channel effect on LSFs. At first, we look at the curve of phase of ratio filter for y(n) and φ y (ω). The mean slope of the curve between ω x k and ω y k is defined by where ω x k and ω y k are the kth LSFs for x(n) and y(n), respectively (see Figure 2). According to (7), we find that Substituting (12) into (11) and applying the relationship of (10), we rewrite (11) as Rearranging (13), we get The above equation states that the channel effect on LSFs is a bias which is in terms of the slope and the channel phase.

COMPENSATION OF CHANNEL EFFECT
Equation (14) indicates that the bias of LSFs resulted from channel effect can be compensated if the slope s y (ω x k , ω y k ) and the channel phase θ h (ω x k ) are available. However, the channel phase is hard to be estimated.
We assume that the channel effect is stationary in an utterance. By taking the average over the whole utterance on (9), we obtain where m is the frame index and L is the number of frames in an utterance. If we subtract the mean from each phase of inverse filter for y(n), it comes out that The result is exactly the mean subtracted phase of inverse filter for x(n). It implies that the mean subtraction on the phase of inverse filter will eliminate the channel phase. By using the mean subtracted phase of inverse filter to find LSFs, the channel effect on LSFs will be minimized. Hence we formulate the equation as follows to solve LSFs: The resulted LSFs are the frequencies that satisfy the following equation:φ y,m ω y k,m = kπ.
The following description is to show how to achieve {ω Applying the equality φ y (ω In order to solve (21) forω y k,m , an iterative scheme based on Newton-Raphson method [22] is applied. At first we define the following quantity: where ω where η is a scalar factor for adjusting the step size and g (ω where δ is a small value and sgn(·) is the sign function.

EXPERIMENTS
Two series of experiments are conducted herein. The first series of experiments use simulated channel distorted speech to examine the channel effect due to handset distortion and also the effect of codec process. The second series of experiments are performed on the real telephone speech.

Experiment 1
The TI digits database is used in this series of experiments.  handset distortion and also the effect of the codec process. Figure 3 shows the characteristics of the 41 handsets used in the experiments. The codec process is the algorithm of ITU G.723.1. The channel distorted speech is simulated as follows to evaluate the channel effect.
(1) In the case of handset distortion, the speech signal is convoluted with a randomly selected handset before feature extraction is performed. LSFs are calculated for the 50% overlapping frames, based on LP analysis. the encoder performs without frame overlapping, the number of extracted frames is inconsistent with that of overlapped frames in comparison. An interpolated frame is inserted into each pair of consecutive frames to overcome this inconsistency. The linear interpolation is applied to determine the average feature vector from each pair of features.
(3) In the case of the combination of handset distortion and codec process, the speech signal is first convoluted with a randomly chosen handset and then fed into the CELP encoder to generate an encoded bit stream.
At first, the learning behavior of the proposed iterative algorithm is investigated. Figure 4a displays the learning  behavior with various scalar factors. The distortion is measured by the average distance between LSFs of before and after channel compensation. The resulted curves show that the iterative scheme converges quickly within first two iterations. For the case of η = 1, the relationship between the iteration number and the recognition performance of simulated channel distorted speech is illustrated in Figure 4b. It shows that the satisfactory performance can be achieved within two iterations. This is very promising for real-time applications. Table 1 displays the performance of using LSFs in speech recognition with speech models trained on clean speech. It shows that the three kinds of distortions substantially degrade the performances. The performance degradations are about 14%, 29%, and 45% for cases of being affected by handset distortion, codec process, and the combination of handset and codec process, respectively. It is obvious that the performance degradation caused by codec process is much worse than that caused by handset distortion. The combination of handset distortion and the codec process results in the worst performance. Significant improvement can be obtained when the proposed channel effect compensation method is applied to the case of handset distortion. However, the performance is less improved for speech distorted by the codec process or the combination of handset distortion and codec process.
For comparison, the performance of using LPCCs derived from LSFs is evaluated and listed in Table 2. Comparing Table 1 with Table 2, we find that the proposed channel effect compensation method gives comparable performance as the CMN method in using LPCCs. Inconsistency in feature extraction substantially degrades the performances for both LSFs and LPCCs.
Tables 1 and 2 also show that the codec process causes the unacceptable performance. The bad performance is due to the mismatches generated by the nonlinear operation of the codec process and the inconsistent feature extraction. The proposed channel effect compensation method cannot effec-  tively compensate for these mismatches. Hence, the speech models are retrained using speech features directly extracted from encoded bit stream. Since both training and testing data are processed by the same codec algorithm, these retrained models give much better performance. Similarly, we also retrain the models in using LPCCs for comparison. Tables 3  and 4, respectively, show the performance of using LSFs and LPCCs with speech models trained on encoded speech. The result indicates that the performance obtained using encoded speech models is substantially enhanced. Although handset distortion significantly degrades the performance in this case, the proposed channel compensation method can effectively recover the performance. The performance is close to that of using LPCCs with speech models trained on encoded speech.

Experiment 2
The subdatabase MATDB-2 of database Mandarin Across Taiwan Table 5 displays the recognition results of using LSFs and LPCCs. It shows that when the cepstral mean subtraction and the proposed channel effect compensation method are not performed, the recognition performance is about 91% for LPCCs and LSFs. When they are performed, the performances are enhanced to about 92.5%. The results suggest that the proposed channel effect compensation method is effective and its performance is comparable to that of using CMN method in LPCCs.

CONCLUSIONS
This work focuses on the compensation of channel effect in LSF domain. When a speech signal is represented in terms of the phase of inverse filter derived from LP analysis, the channel distortion can be expressed in terms of the channel phase. Further derivation shows that the mean subtraction performed on the phase of inverse filter can minimize the channel effect. Based on this finding, an iterative algorithm is proposed to compensate the channel effect. To demonstrate the effectiveness of the proposed methods, two series of experiments on the simulated channel distorted speech and the real telephone speech are conducted. The experimental results show that the proposed methods yield significant improvements for both situations. The performance of the proposed method is comparable to that of CMN in using cepstral coefficients.