The multiple linear regression model-based LS method for IPD estimation is proposed in the expanded phase domain, **Ω**_{
d
}. The proposed method is composed of two stages: the multiple linear regression model-based LS-TDE at the first stage, and the RLS-based source tracking method using the delay information estimated at the first stage. After constructing an LS cost function for the TDE method based on the multiple linear regression model, it is verified that the proposed LS method is an ideal estimator which is unconstrained by phase wrapping. In the second stage, the RLS-TDE method is proposed which works very well for both fixed and moving source tracking. The proposed RLS method can be implemented by a simple equation, and it is also appropriate for conversational speech. Finally, a novel two-channel weighting method for noisy and reverberant environment is described.

### 4.1. First stage: multiple linear regression model-based TDE

In Section 3.2, the multiple linear regression model including three-linear lines in 6*π* interval is explained in detail. The proposed LS criterion using the multiple linear regression model is given as

{\widehat{\tau}}_{E,d}=\text{arg}\underset{\tau}{\text{min}}\sum _{m=-1}^{1}\sum _{l}|\left({\omega}_{l}\tau +2m\pi -{\xi}_{E,d}\left({\omega}_{l}\right)\right){|}^{2},

(8)

where *d* = -1, 0, 1 is the expanded domain index, *l* = 0, 1, ..., 4K -1 is the interpolated frequency index, and *ξ*_{E,d}(*ω*_{
l
}) ∈ **Ω**_{
d
}is the expanded observation phase for each case in Figure 3. Then, the LS solution is derived by taking a derivative to the term *τ* as follows:

0=6\sum _{l}\left({\omega}_{l}^{2}\tau -{\omega}_{l}{\xi}_{E,d}\left({\omega}_{l}\right)\right)+4\pi \sum _{m=-1}^{1}m\sum _{l}{\omega}_{l}.

(9)

The second term in Equation 9 corresponding to phase shifting is equal to zero. Therefore, the proposed multiple linear regression model-based LS-TDE in the expanded phase domain is equivalent to the conventional LS equation given in Equation 4. Finally, the proposed LS solution is easily calculated by adopting a vector notation, {\widehat{\tau}}_{E,d}={\left({\stackrel{\u0304}{\omega}}^{H}\stackrel{\u0304}{\omega}\right)}^{-1}{\stackrel{\u0304}{\omega}}^{H}{\stackrel{\u0304}{\xi}}_{E,d} where \stackrel{\u0304}{\omega} and {\stackrel{\u0304}{\xi}}_{E,d} are *L*_{
d
}× 1 vectors, *L*_{
d
}is the number of discrete frequencies satisfying *ξ*_{E,d}(*ω*_{
l
}) ∈ **Ω**_{
d
}. A weighted solution which does not affect above derivation is given as

{\widehat{\tau}}_{E,d}={\left({\stackrel{\u0304}{\omega}}^{H}\Psi \stackrel{\u0304}{\omega}\right)}^{-1}{\stackrel{\u0304}{\omega}}^{H}\Psi {\stackrel{\u0304}{\xi}}_{E,d},

(10)

where **Ψ** is a diagonal matrix composed by a reciprocal of IPD error variance related to the SNR of the input signal. The variance of IPD error at interpolated frequency is same as original variance. The proposed solution in the expanded phase domain, Equation 10, is not only unconstrained by the phase wrapping but also corresponding to the ideal LS solution of Equation 4. Furthermore, Equation 10 becomes an MVU estimator since the Gaussian assumption for the IPD error, Equation 7, is valid in the expanded phase domain. Finally, the estimator determines the most accurate delay among the estimated results in each expanded phase domain by measuring Euclidean distance between the estimated and the observed phases as follows:

{\widehat{\tau}}_{LS}=\text{arg}\underset{d}{\text{min}}{\left(\sum _{l}{\left({\omega}_{l}{\widehat{\tau}}_{E,d}-{\xi}_{E,d}\left({\omega}_{l}\right)\right)}^{2}\right)}^{1/2},\phantom{\rule{1em}{0ex}}d=-1,0,1.

(11)

### 4.2. Second stage: RLS for moving speaker tracking

Generally, an LS-TDE in a single-frame-based process easily confronts the lack of data problem because the frame length for analyzing speech signal is only 20-30 ms and the sampling frequency is limited to the capacity of usual electronic devices. As the more data set is available, the performance of TDE becomes closer to the ideal lower bound such as Cramer-Rao bound (CRB) [30, 32]. To use multiple frames for TDE, however, non-stationarity of the speech signal and moving source case should be considered. This article proposes an RLS-TDE method which improves the performance of TDE by considering an arbitrarily moving speaker. At first, the LS-TDE result, {\widehat{\tau}}_{LS}, of the first stage is used to select the frequencies for the RLS processing as follows:

\left\{{\omega}_{l}\left|\right|{\omega}_{l}{\widehat{\tau}}_{LS}-{\xi}_{E}\left({\omega}_{l}\right)|<\pi \right\},\phantom{\rule{1em}{0ex}}l=0,1,\dots ,L-1.

(12)

Using the criterion given in Equation 12, the frequencies whose phases within a 2*π* interval around a straight line, f\left({\omega}_{l}\right)={\omega}_{l}{\widehat{\tau}}_{LS}, are selected as candidates for the second stage. Three new vectors are defined to simplify the equation such that, {\stackrel{\u0304}{\omega}}_{r}\left(n\right) is the frequency vector satisfying Equation 12 at *n* th frame and {\stackrel{\u0304}{\xi}}_{r}\left(n\right),{\Psi}_{r}\left(n\right) are related phase vector and diagonal matrix of weighting vector, respectively. Then, the RLS criterion is given as

J=\sum _{q=0}^{Q}{\delta}^{q}\left(\sum _{m=-1}^{1}{\u0100}^{T}\left(m,n-q\right){\Psi}_{r}\left(n-q\right)\u0100\left(m,n-q\right)\right),

(13)

where *T* means vector transpose, *δ* is a positive constant less than one, *Q* is the maximum number of observation frames. The criterion vector, \u0100\left(m,n\right), and the arbitrary vector, \u012a, are defined as

\u0100\left(m,n\right)=\left({\stackrel{\u0304}{\omega}}_{r}\left(n\right)+2\pi m\u012a-{\stackrel{\u0304}{\xi}}_{r}\left(n\right)\right),\u012a={\left[1,\dots ,1\right]}^{T}.

(14)

Finally, the RLS-TDE is represented by

{\widehat{\tau}}_{\mathsf{\text{RLS}}}\left(n\right)=\frac{\sum _{q=0}^{Q}{\delta}^{q}\left({\stackrel{\u0304}{\omega}}_{r}^{T}\left(n-q\right){\Psi}_{r}\left(n-q\right){\stackrel{\u0304}{\xi}}_{r}\left(n-q\right)\right)}{\sum _{q=0}^{Q}{\delta}^{q}\left({\stackrel{\u0304}{\omega}}_{r}^{T}\left(n-q\right){\Psi}_{r}\left(n-q\right){\stackrel{\u0304}{\omega}}_{r}\left(n-q\right)\right)}.

(15)

Equation 15 is same as Equation 10 except the term *δ*^{q}which exponentially decreases the contribution of the past data set. In addition, a process is included such that all of the RLS vectors are initialized when long silence interval is included in the observation data. Experimental results described in detail later confirm that the performance of RLS-TDE is superior to conventional methods even for the fast moving speech source.

### 4.3. Weighting for LS-TDE in noisy and reverberant condition

In Section 3.1, it is shown that the IPD error distribution can be regarded as Gaussian with variance (2 × SNR)^{-1}. Actually, this property is implied in the ML TDE explained in the Knapp's method [9] that the ML weighting is derived from MSC. Note that MSC can be regarded as an SNR of the input signal. In practice, MSC must be estimated by the observed data set using a temporal averaging method [33]. However, it is hard to estimate accurate MSC for non-stationary data such as speech signal. The proposed method adopts an approximated-ML weighting which is roughly equivalent to the SNR evaluated from a single frame as follows [12, 22, 23]:

\psi \left({\omega}_{k}\right)=\frac{\left|{X}_{1}\left({\omega}_{k}\right)\right|\left|{X}_{2}\left({\omega}_{k}\right)\right|}{\left|{N}_{1}\left({\omega}_{k}\right){|}^{2}\right|{X}_{2}\left({\omega}_{k}\right){|}^{2}+\left|{N}_{2}\left({\omega}_{k}\right){|}^{2}\right|{X}_{1}\left({\omega}_{k}\right){|}^{2}}.

(16)

The proposed LS-TDE in the expanded phase domain given in Equation 10 with the weighting function above satisfies all the ML estimation conditions, e.g., the Gaussian assumption of IPD error and weighting of its variance reciprocal. The weighting given in Equation 16 is useful when the coherence between two noises of dual-sensor and the target speech signal are ignor-able. However, it cannot distinguish values of speech from other signals if we assume a reverberant environment. Piersol [20] paid attention to the spatial coherence between two-sensors and proved the effects to the TDE by lots of experimental results, which are consistent with the theoretical analysis. To design a practical two channel system under the reverberant environment, a substitutable method which can suppress the reverberation effect by signal-to-reverberation (SRR)-based weighting is introduced.

To estimate the power of the direct signal and reverberant components, a two-channel generalized side-lobe canceller (GSC) structure is adopted [34]. Figure 5 shows a simplified block diagram to estimate the direct signal power. In this method, the power envelop of the delay-and-sum beamformer (DSB) output, *Q*(*ω, n*), and the delay-and-subtract output used for a reference signal, *U*(*ω, n*), are obtained by using the first-order recursive equations:

\begin{array}{ll}\hfill {\lambda}_{q}\left(w,n\right)& =\eta {\lambda}_{q}\left(\omega ,n-1\right)+\left(1-\eta \right)|Q\left(\omega ,n\right){|}^{2},\phantom{\rule{2em}{0ex}}\\ \hfill {\lambda}_{u}\left(\omega ,n\right)& =\eta {\lambda}_{u}\left(\omega ,n-1\right)+\left(1-\eta \right)|U\left(\omega ,n\right){|}^{2},\phantom{\rule{2em}{0ex}}\end{array}

(17)

where *n* is frame index and *η* is a forgetting factor set close to, but less than, one. Then, the energy of reverberant residual components, {\widehat{\lambda}}_{r}\left(\omega ,n\right) is obtained as follows:

{\widehat{\lambda}}_{r}\left(\omega ,n\right)=W\left(\omega ,n\right){\lambda}_{u}\left(\omega ,n\right),

(18)

where *W*(*ω, n*) is a frequency dependent gain that is adaptively updated using a quadratic cost function, *J*_{
w
} = {*λ*_{
e
}(*ω, n*)}^{2}, where the error, *λ*_{
e
}(*ω, n*), is equal to {\lambda}_{q}\left(\omega ,n\right)-{\widehat{\lambda}}_{r}\left(\omega ,n\right). Finally, the direct signal power is estimated using a spectral-subtraction method [35]:

|{\u015c}_{d}\left(\omega ,n\right){|}^{2}=|Q\left(\omega ,n\right){|}^{2}-{\widehat{\lambda}}_{r}\left(\omega ,n\right).

(19)

In Habets's de-reverberation method [34], a post filter is applied to the DSB output, *Q*(*ω, d*), however, the spectral subtraction method, given in Equation 19, is good enough in our application because only the power envelop of the direct signal component is needed. Finally, the SRR is represented as follows (omitting frame index similar to Equation 16):

\psi \left({\omega}_{k}\right)=\frac{|{\u015c}_{d}\left({\omega}_{k}\right){|}^{2}}{{\widehat{\lambda}}_{r}\left({\omega}_{k}\right)}.

(20)

The proposed method well suppresses the late reverberation but has no impact on the early reflected component which is the principle reason of bias for the IPD distribution. The bias caused by early reflection entirely depends on the physical conditions including the shape of room, sensor and source position, etc. It is still a challenging research area to deal with the early reflection blindly.