Split SR-RLS for the Joint Initialization of the Per-Tone Equalizers and Per-Tone Echo Cancelers in DMT-Based Receivers

In asymmetric digital subscriber lines (ADSL), the available bandwidth is divided in subcarriers or tones which are assigned to the upstream and/or downstream transmission direction. To allow e ﬃ cient bidirectional communication over one twisted pair, echo cancellation is required to separate upstream and downstream channels. In addition, intersymbol interference and intercarrier interference have to be reduced by means of equalization. In this paper, a computationally e ﬃ cient algorithm for adaptively initializing the per-tone equalizers (PTEQ) and per-tone echo cancelers (PTEC) is presented. For a given number of equalizer and echo canceler taps per-tone, it was shown that the joint PTEQ/PTEC receiver structure is able to maximize the signal-to-noise ratio (SNR) on each subcarrier and hence also the achievable bit rate. The proposed initialization scheme is based on a modiﬁcation of the square root recursive least squares (SR-RLS) algorithm to reduce computational complexity and memory requirement compared to full SR-RLS, while keeping the convergence rate acceptably fast. Our performance analysis will show that the proposed method converges in the mean and an upper bound for the step size is given. Moreover, we will indicate how the presented initialization method can be reused in several other ADSL applications.


INTRODUCTION
ADSL stands for asymmetric digital subscriber lines and is able to provide broadband data transmission over the existing telephone network.To increase the spectral efficiency of the available bandwidth, ADSL employs a transmission technique based on multicarrier modulation, namely, discrete multitone (DMT) [1,2].DMT divides the available bandwidth into N parallel subchannels or tones, by means of an N-point inverse fast Fourier transform (IFFT).At the transmitter, each tone is modulated by quadrature amplitude modulation (QAM) and IFFT transformed to obtain a time domain signal.At the receiver, an N-point FFT can be used for demodulation.Prepending each data block after IFFT modulation with a cyclic prefix ensures that the subchannels remain independent after transmission over a channel.If the order of the channel (modeled as an FIR filter) is smaller than the cyclic prefix length, ν, the transmitted signal can easily be recovered by a bank of complex scalars, the so-called frequency domain equalizers (FEQs).
In the ADSL context, the channel impulse response typically exceeds the cyclic prefix length, thereby destroying subchannel orthogonality.As a result, intersymbol interference (ISI) and intercarrier interference (ICI) will be present and a channel-shortening time domain equalizer (TEQ) is required [3,4,5,6,7].An alternative equalization structure is based on "per-tone" equalization (PTEQ), which accomplishes the joint task of TEQ/FEQ independently for each tone [8,9].
Besides equalization, echo cancellation is required to separate upstream and downstream signals and to enable efficient bidirectional communication over the same telephone wire.Echo occurs due to signal leakage from the transmit side to the receive side in the modem since both sides are imperfectly coupled to the telephone line.If properly designed, echo cancellation can improve the reach and/or noise margin of an ADSL system by allowing both upstream and downstream signals to share the low frequency portion of the available frequency band.
Several echo cancellation structures for DMT transceivers have been studied in literature [6,8,10,11,12,13].All the proposed structures exploit a common principle, namely, the echo channel is estimated through an adaptive updating process and an emulated version of the echo is subtracted from the received signal.Unfortunately, the echo cancelers, studied in [10,11,12], are designed independently from the equalizer.Van Acker et al. presented a joint per-tone echo cancellation (PTEC) and PTEQ, where an echo canceler and equalizer have to be designed for each tone separately [13].For a given number of equalizer and echo canceler taps per subcarrier, this approach is able to optimize the signal-to-noise ratio (SNR) on each subcarrier and hence maximizes the achievable bit rate [13].
In this paper, we will focus on adaptively initializing the PTEQ/PTEC receiver structure.The problem consists of solving several parallel minimum mean square error (MMSE) problems (one MMSE problem for each tone) in an adaptive way.We are especially interested in developing an adaptive algorithm which exhibits fast convergence, low memory requirement, and low computational complexity.
In the literature, several adaptive algorithms exist to solve an MMSE problem of the form where E {•} represents the expectation operator, {•} T denotes the transpose, d (k) is some desired signal at time k, w are the unknown coefficients and u (k) is the input vector.The most well-known and extensively studied adaptive algorithm is certainly the least mean square (LMS) algorithm by Widrow and Hoff [14,15].Although the algorithm is simple, the bad conditioning of the input autocorrelation matrices (one for each tone) for the PTEQ/PTEC receiver, leads to slow convergence.Since the seventies, a lot of effort has been spent to find alternatives for LMS with faster convergence, which has lead to a variety of algorithms.
(i) LMS derivatives: these algorithms are derived from the original LMS scheme and include algorithms as normalized LMS (NLMS) [14] and looping LMS (LLMS) [16].In NLMS, the step size is normalized with the in-put signal power to avoid gradient noise amplification [14], which leads to slightly improved convergence.LLMS repeatedly applies LMS to a block of data, but still requires too many iterations and computations in case of the PTEQ/PTEC receiver.(ii) Transform domain LMS: this type of adaptive filters refers to LMS filters where blocks of input data are preprocessed with a (unitary) data-independent transformation [17,18].The main purpose of this preprocessing step is to improve the eigenvalue distribution of the input autocorrelation matrix and hence to accelerate convergence.The choice of this transformation largely depends on the underlying problem.Time series filtering applications, where u (k) is drawn from a tapped delay line, typically use the discrete Fourier transform (DFT), to obtain the so-called frequency domain LMS algorithm.However, the PTEQ/PTEC receiver is in fact a "linear combiner" problem, where no shift structure in u (k) is available.Hence, an optimal transformation is not straightforward to obtain.(iii) Square root recursive least squares (SR-RLS): in general, the SR-RLS algorithm does not impose any restrictions on the input data structure u (k) .SR-RLS exhibits fast convergence, be it that SR-RLS adds computational complexity, compared to the LMS derivatives.Since the order of complexity increases with the square of the number of parameters in w, complexity reductions are desired.To mitigate the high computational burden of RLS, the family of fast RLS algorithms such as fast transversal filters (FTF) [19] and QR-decomposition based lattice filters (QRD-LSL) have been proposed.Unfortunately, the complexity reductions attained in these algorithms rely again on the signal shift nature of the filtering problem.Hence, these fast schemes are not suitable for our problem in particular.(iv) Split RLS: this algorithm approximates the RLS algorithm with several lower-dimensional RLS problems and is able to obtain a complexity which is linear in the number of parameters [20].Although this method does not require any specific data structure, only the estimation error is computed without finding w directly.Moreover, the authors of [20] do not prove the convergence of the obtained algorithm and indicate that a high level of misadjustment is possible for highly correlated input signals.
The contributions of this paper can be summarized as follows.First, we will derive a general method for adaptively computing w of (1) without relying on any specific data structure in u (k) .Whereas the split RLS algorithm of [20] only computes the estimation error, d (k) − w T u (k) , the proposed method "merges" the SR-RLS 1 and the split RLS algorithms to find the tap weight vector w explicitly.The resulting structure will be referred to as split SR-RLS.As opposed to [20], we will provide a general proof of convergence.The proof will indicate that the step size of the proposed adaptation process can always be chosen in such a way that convergence in the mean is achieved.In addition, an upper bound for the step size will be derived.
The second contribution of this paper is the application of the proposed split SR-RLS method to the PTEQ/PTEC initialization problem.Due to the specific nature of the PTEQ/PTEC input elements, we will illustrate how a lower complexity and lower memory requirement can be achieved compared to full SR-RLS.Although the rate of convergence will be slower than full SR-RLS, the presented algorithm will converge much faster than NLMS.We will also indicate briefly the applicability of the proposed split SR-RLS method to other ADSL initialization problems.
The paper is organized as follows.In Section 2, the data model and the notation for standard adaptive algorithms are introduced.Section 3 describes the split SR-RLS algorithm, which is applied to initialize the PTEQ/PTEC in Section 4. Finally, simulation results are presented in Section 5, followed by the conclusions in Section 6.

Notation
Throughout this paper the following notation will be used: (i) time domain vectors and matrices are indicated by bold face lower case and upper case letters, respectively; (ii) {•} T , {•} H , {•} * denote transpose, complex conjugate transpose and complex conjugate, respectively; (iii) w is the unknown, complex-valued tap weight vector with T parameters, while u (k) is used to indicate a complex-valued input signal vector at time k; (iv) X uu and X ku denote autocorrelation and crosscorrelation matrices, respectively, (defined in ( 5) and ( 13)).

Problem formulation
Given the input data vectors u (k) at time instant k, the goal is to find the T unknown weight coefficients such that the filter output, w T u (k) , is as close as possible to some desired signal d (k) in mean square sense, compare (1).Here, every variable can be complex-valued and no specific structure on the input data is assumed.In general, w just forms a linear combination of the input elements and is henceforth referred to as a linear combiner.In the following subsections, we will discuss NLMS and SR-RLS to find the optimal MMSE solution of (1) in an adaptive way.

Least mean square
The (normalized) LMS algorithm was designed as a stochastic gradient descent method to solve (1) [14].It approximates the MMSE solution by continuously updating the weight vector w as new data vectors are received, according to where e (k) = d (k+1) − w (k) T u (k+1) , µ represents the step size to govern the convergence rate and α prevents overflow for signals with low energy.This algorithm is computationally simple, but a large eigenvalue spread of the input correlation matrix, often leads to a convergence rate which is unacceptably slow.

Square root recursive least square
To overcome the slow convergence of LMS, (1) can be approximated by a least squares (LS) problem min where d (k) is a vector of k + 1 training or desired symbols and U (k) contains a set of k + 1 input signal vectors Given U (k) H U (k) is full rank2 , the LS solution of ( 6) is given by With [21], we can rewrite (9) as where (k) .The SR-RLS algorithm is based on iteratively updating the lower triangular matrix S (k) = R (k) −T by means of unitary Givens or Jacobi rotations [14].The matrix R (k) is the (upper triangular) Cholesky factor of the sample covariance matrix Often, an exponential weighting factor 0 < λ < 1 is included to ensure that data in the distant past is forgotten in order to track Initialize filter coefficients w (0) and S (0) .For k = 0, . . ., ∞, (1) form the matrix-vector product: (2) for m = 0, . . ., T − 1, determine the Givens rotations [14] Q m , where each Q m zeroes out the (m + 1)st element of a: (3) update S (k) and determine the Kalman gain vector, k (k+1) , using the previously obtained Q m , m = 0, . . ., T − 1.
statistical variations of the input data in a nonstationary environment.Correspondingly, we can write where 1/(1 − λ 2 ) represents in fact the memory of the system.The last equality only holds for large k and λ close to unity.
As mentioned before, LMS convergence is dictated by the eigenvalue spread of the input correlation matrix X uu .SR-RLS is able to "get rid" of the eigenvalue spread by using an iterative update based on a transformed update direction which is called the Kalman gain vector.An efficient realization of updating S (k) and w (k) is described in Algorithm 1 [22].Similar to LMS (cf.( 5)), the convergence of SR-RLS is determined by the crosscorrelation matrix of k (k) and u (k) : Based on ( 11), (12), and ( 13), we observe that all eigenvalues of X ku are (approximately) equal.Hence, the Kalman gain update direction removes the eigenvalue spread and by this improves the convergence speed.This improvement in performance, however, is achieved at the expense of a large increase in computational complexity and memory requirement.Whereas the complexity of NLMS is on the order of O(T), the complexity and memory requirement of SR-RLS is O(T 2 ).

SPLIT SR-RLS WITH REDUCED COMPLEXITY
To alleviate the computational burden of a full-blown SR-RLS, the input elements of the "linear combiner" application under consideration could be divided into smaller groups, compare the split RLS algorithm in [20].Unlike [20], our goal is to compute w (k) instead of e (k) only.As we will motivate in the next section, we are mainly interested for the PTEQ/PTEC receiver in dividing the input vector into two (unequal) parts.The ultimate goal is to design a modified SR-RLS scheme maintaining a fast convergence rate but with lower computational complexity and lower memory requirement.
To achieve this goal, we will merge the split RLS and SR-RLS algorithm into a split SR-RLS algorithm.Assume we split the input vector u (k) into two parts of length T 1 and T 2 , respectively, such that T 1 + T 2 = T (a reordering of the inputs might be possible), that is, with Now, we design a separate SR-RLS problem for each set of inputs.This requires two lower triangular matrices S (k) 1 and S (k)  2 (of size T 1 × T 1 and T 2 × T 2 , respectively) to be updated, see Algorithm 2. The update direction is now determined by l (k+1) , which consists of a concatenation of two Kalman gain vectors, one for each input set.Similar to (12), we can write Notice that a step size µ has been added to ensure convergence.In Appendix A, we show that the convergence of the proposed scheme is determined by the maximum eigenvalue of the crosscorrelation matrix between l (k) and u (k) : Additionally, in Appendix B it is shown that X lu has eigenvalues 1 − λ 2 with multiplicity T 1 − T 2 and 2T 2 eigenvalues equal to (1 − λ 2 )(1 ± d i ), with the d i 's equal to the cosines squared of the principal angles between the subspaces S 1 and S 2 spanned by the columns of U (k) 1 and U (k) 2 , where Initialize filter coefficients w (0) and S (0) 1 , S (0) 2 .For k = 0, . . ., ∞, (1) form the matrix-vector products: (2) for m = 0, . . ., T − 1, determine the Givens rotations [14] Q m , where Q m zeroes out the elements of a 1 and a 2 : (3) update S (k) 1 and S (k) 2 and determine the Kalman gain vector using the previously obtained Q m , m = 0, . . ., T − 1. Apply exponential weighting with λ: (4) update w (k) : Algorithm 2: The split SR-RLS algorithm.
U (k) 1 and U (k) 2 are matrices containing the first T 1 and the last T 2 columns of U (k) , respectively.Apparently, the modified update direction is able to remove partially the eigenvalue spread and by this will lead to a convergence speed in between SR-RLS and NLMS.In Appendix B, it is also shown that convergence in the mean is achieved when µ satisfies Since the convergence rate depends on the eigenvalue spread of X lu , convergence will be faster when all eigenvalues tend to be equal, that is, when the cosines of the principal angles between S 1 and S 2 go to zero.Hence, the convergence rate will be faster whenever S 1 and S 2 are more orthogonal.The proposed algorithm is straightforwardly obtained but can attain substantial complexity improvements and memory reductions, as illustrated in the following section.Similar to [20], the algorithm could be extended to more than two distinct parts, leading to higher misadjustment and slower convergence.In this case, an upper bound for the step size can not easily be derived.In the limit, we obtain an LMS like update, where each input element is weighted with the averaged energy of that element.

SPLIT SR-RLS INITIALIZATION OF THE PTEQ/PTEC RECEIVER
In this section, we will apply the split SR-RLS algorithm for the initialization of the PTEQ/PTEC receiver structure.The PTEQ-only receiver [9] will be briefly reviewed in the first subsection and will be extended with PTEC in the second subsection [13].

Per-tone equalization
As mentioned in the introduction, the channel impulse response in the ADSL context typically exceeds the cyclic prefix length, thereby destroying subchannel orthogonality.The resulting ISI and ICI can be mitigated by means of a channelshortening TEQ combined with a bank of one-tap FEQs [3,4,5,6,7].An alternative equalization structure is based on PTEQ, which accomplishes the joint task of TEQ/FEQ independently for each subcarrier [8,9] and which is able to optimize the overall bit rate.In the following, the ADSL data model is mainly based on [9] and only the main results will be repeated here.
Mathematically, the received signal vector y (k) is obtained from the transmitted data through the following operations: where h is a row vector representing the overall channel (transmit and receive filters plus telephone wire), n (k)  is additive channel noise, s = N + ν, and T EQ is the number of PTEQ taps per-tone.The vector X (k) contains the data symbol of interest, X k 1:N , as well as the preceding and succeeding symbol.The data vector is first IDFT modulated (by means of the IDFT-matrix I N ) and afterwards a cyclic prefix is inserted, represented by P. The matrices 0 (1,2) are zero matrices of appropriate dimension [9] and 1 is the synchronization delay, which is a design parameter.
After DFT demodulation (implemented by the DFTmatrix F N ), PTEQ of tone i is accomplished by forming a linear combination of the ith DFT output, Y (k)  i , with T EQ − 1 real-valued difference terms of y (k) : ∆y (k) .The output of the per-tone equalizer for tone i can be obtained as where vi is the equalizer for tone i and F N (i, :) represents the ith row of F N .The MMSE solution for vi is obtained as where X (k) i is the QAM symbol of interest, transmitted on tone i.Note that vi is a linear combiner and has to be initialized for each tone.The inputs u (k)  i can be separated into two parts: (i) the elements of ∆y (k) are real-valued since they are formed out of a pre-FFT signal and henceforth are common for all subcarriers, (ii) Y (k) i is complex-valued and tone dependent.
The distinct nature of the inputs will be exploited when applying the split SR-RLS to the overall PTEQ/PTEC structure.

Joint per-tone echo cancellation and per-tone equalization
In ADSL, the available subchannels are assigned to either the upstream or downstream transmission direction, or to both.As transmission in both directions takes place over a single twisted pair, the transmitter and receiver at one end are coupled to the line by a hybrid.A perfectly balanced hybrid prevents leakage of transmitted signals into the receiver.However, due to large variations in the subscriber loops, a fixed hybrid is not able to exactly balance all possible loops and hence leakage or echo occurs.To allow efficient bidirectional communication over one twisted pair, echo cancellation is required to separate upstream and downstream channels.Due to the asymmetric character of ADSL transmission, a smaller bandwidth (25-138 kHz) is foreseen for the upstream direction compared to the downstream direction (25-1104 kHz) and echo cancellation enables to share the low frequency portion of the available frequency band.
In this subsection, we will focus on the per-tone echo cancelers where the bank of per-tone equalizers is extended with a bank of per-tone echo cancelers [13].The resulting echo cancellation is then completely done for each tone separately.For a given number of equalizer and echo canceler taps per-tone, this approach is able to maximize the achievable bit rate [13].
An initialization formula has been derived in [13], based on an exact channel model and exact knowledge of the signal and noise statistics.This direct initialization results in a high computational cost.Hence, we will focus in this paper on adaptively initializing the joint PTEQ/PTEC structure.
When echo is present, the overall received signal vector r (k) is obtained as where y (k) E is the received echo component modeled as Here, the row vector h E represents the overall echo channel and U (k) are the transmitted echo symbols.Again, the matrices 0 (3,4) are zero matrices of appropriate dimension [13].Now, define the echo reference signal as u k , which contains a block of T EC cyclically prefixed, transmitted time domain echo samples.The exact position of this data block within the transmitted echo stream depends on the alignment between echo symbols with respect to far end symbols, see [8,13] for more details.The output of the joint PTEQ/PTEC for tone i can mathematically be written as where vE,i is the T EC -taps echo canceler for tone i and ∆r (k) , ∆u (k) , R (k) i , and Ũ(k) i are the T EQ − 1 difference terms of the received signal, the T EC − 1 difference terms of the echo reference signal and the corresponding DFT outputs for tone i, respectively.The MMSE solution for vi and vE,i can be obtained as the solution of vi,MMSE vE,i,MMSE = min vi, vE,i Also here, the linear combiners, vi and vE,i , have to be initialized for each tone i.The input vector has similar properties as the PTEQ-only problem: (i) ∆r (k) and ∆u (k) are (T EQ − 1) + (T EC − 1) real-valued difference terms which are common for all frequency bins, (ii) R (k)   i and Ũ(k) i are 2 complex-valued DFT outputs for each tone i.
By reordering the inputs, we are able to separate the common part and the per-tone part, that is, The straightforward application of SR-RLS, according to Algorithm 1, to initialize the PTEQ/PTEC coefficients, will lead to a matrix S (k) = S (k)   i that is different for each tone.However, due to the reordering of the inputs, the T EQ + T EC − 2 real difference terms, ∆r (k) and ∆u (k) , give rise to a (T EQ + T EC − 2) × (T EQ + T EC − 2) real triangular part in S (k)   i which is common for all the tones, similar to [23].The FFT outputs are taken as the last inputs to the SR-RLS-structure and make only the two last (bottom) rows of S (k) i tone dependent.Hence, full SR-RLS for PTEQ/PTEC initialization requires the update and the storage of a common lower triangular matrix of size (T EQ + T EC − 2) × (T EQ + T EC − 2) and 2 tone dependent rows of length (T EQ + T EC ).
To avoid all the complexity and memory requirement of a full SR-RLS, the split SR-RLS (cf.Algorithm 2) can be applied with T 1 = T EQ − 1 + T EC − 1 and T 2 = 2.The matrix S (k)  1 will again be constructed based on ∆r (k) and ∆u (k) only and hence will be real-valued and common for all the carriers.The second matrix S (k)  2,i is lower triangular of dimension 2 × 2, complex-valued, and tone dependent since it receives R (k)   i and Ũ(k) i as inputs.The resulting initialization algorithm is given in Algorithm 3 and depicted in Figure 1.
Figure 1 represents a signal flow graph (SFG) for the initialization of the PTEQ/PTEC receiver.The functionality of the building blocks is also explained and is based on [23].The hexagons represent the computational complexity to update S (k)  1 and S (k) 2,i by means of Givens rotations.Observe that S (k)  1 is common for all the tones and S (k) 2,i has to be computed for each tone separately.Note that when considering only the first T EQ − 1 difference terms and R (k)  i as inputs in Figure 1, we obtain a SFG for PTEQ initialization.A similar approach for PTEQ-only initialization was followed in [24,25], where a mixture of SR-RLS and LMS was applied instead of a split SR-RLS algorithm.
To see the benefits of the split SR-RLS scheme, we should compare the proposed scheme with the original SR-RLS initialization.When SR-RLS is applied for the PTEQ/PTEC initialization, the real-valued common matrix S (k) 1 in Algorithm 3 is equal to the common part of the full SR-RLS scheme.On the contrary, S (k)  2,i is reduced to a 2 × 2 complex-valued lower triangular matrix per-tone instead of a complex-valued 2 × (T EQ + T EC ) matrix per-tone with full SR-RLS.
Due to the asymmetric character of ADSL data transmission, the upstream signal (from customer to central office) will typically be generated and demodulated by an (I)DFT size which is κ times smaller than the corresponding (I)DFT size for the downstream signal (from central office to customer).This has some implications on the complexity.
(i) In a typical downstream ADSL scenario (modem at the customer premises), the echo transmit IDFT (upstream signal) is κ times smaller than the receive DFT size.Van Acker et al. showed that due to this asymmetry, the number of PTEC taps can be reduced by a factor κ [8,13].As a result, the split SR-RLS scheme is able to save 2 • (2 • (T EQ + T EC /κ − 2)) • N u memory places, where N u is the number of used tones and the additional factor 2 is due to the complex-valued elements.Also the corresponding computational complexity to update S (k) 2,i is reduced with a similar factor.Typical values for downstream ADSL are T EQ = 16, T EC = 200, κ = 8, and N u = 223.(ii) In the upstream case (modem at the central office), where the echo transmit IDFT is κ times larger than the receive DFT size, κ DFT's are required for the PTEC [13].By this, S (k) 2,i is of size (κ + 1) × (κ + 1) or (κ + 1) × (T EQ + T EC ) for the split SR-RLS or the original SR-RLS, respectively.Now, we gain approximately 2 • ((κ + 1) • (T EQ + T EC − κ − 1)) • N u memory places.Typical values for upstream ADSL are T EQ = 40, T EC = 200, κ = 8, and N u = 25.

Similar applications
Finally, we want to mention briefly some other ADSL initialization problems where a similar split SR-RLS approach could be followed.
(i) In [26], a joint PTEQ and windowing receiver structure is described, which require the initialization of T coefficients for each tone.Here, narrow band radio frequency interference (RFI) is mitigated by adding a fixed window in front of the demodulating DFT.When, for example, a trapezoidal window is used, the split SR-RLS algorithm could be applied (similar to Section 4.2) with T 1 = 2(T − 2) (tone independent) and T 2 = 2 (tone dependent) [26].For a raised cosine window the following values are required: T 1 = 2(T − 2), and T 2 = 3 [27].(ii) In [28], PTEQ in combination with the mitigation of a dominant alien near-end crosstalker such as HDSL, SDSL, or HPNA was addressed.Again, initialization of T coefficients with the split SR-RLS is possible with T 1 = 2(T − 2) (tone independent) and T 2 = 2 (tone dependent).
For further details on these applications, we refer to the corresponding papers.

SIMULATION RESULTS
The split SR-RLS scheme will be demonstrated by ADSL simulations for the PTEQ/PTEC receiver structure.As a performance measure for the simulations, we will use the SNR i for tone i and the overall bit rate, according to the following formulas: where b i is the number of bits assigned to tone i, Γ is the SNR gap, γ m the noise margin, and γ c the coding gain.The SNR was calculated based on [9].In our simulations the following values were used: N = 512, ν = 32, Γ = 9.8 dB, γ m = 6 dB, γ c = 3 dB, and F s = 2.208 MHz.Simulations were performed on CSA standard loops (see e.g.[4]) with additive white Gaussian noise of −140 dBm/Hz and 24 DSL near-end crosstalk (NEXT) disturbers.For downstream transmission, the used tones range from 33 to 255, while upstream was simulated with tones 7 to 31.
Figure 2 shows typical power spectral densities of the received far-end, echo, and channel noise signals before and after DFT demodulation for the CSA-1 loop.The tone spacing is 4.3125 kHz.In this scenario, the upstream signal is modu- lated by a 64-point IDFT which causes echo due to aliasing and DFT leakage at the downstream receiver (with a 512point DFT, κ = 8).The PSD on the transmitted upstream and downstream tones are −38 dBm/Hz and −40 dBm/Hz, respectively.The echo and far-end channels include the transmission loop together with all the transmit and receive front end filters.Although the tones are "separated" in frequency, one can clearly see that all the tones at the receiver are affected by echo.Hence, echo canceling on all subcarriers is required.
Figure 3 depicts the SNR evolution during convergence of the PTEQ/PTEC coefficients for the split SR-RLS scheme with T EQ = 16 and T EC /κ = 25.The simulation was again performed for a downstream CSA-1 loop.The training and echo sequence were constructed using 4-QAM modulation on all the tones.Notice that especially low and high tones have a relatively slow convergence due to the high ISI and ICI present in this region.
To illustrate the convergence rate of the split SR-RLS versus the original SR-RLS, simulations were performed on several CSA loops for PTEQ/PTEC initialization.Downstream and upstream bit rates as a function of the number of training symbols are depicted in Figures 4 and 5, respectively.In the simulations, a 64-point DFT and IDFT and a 512-point DFT and IDFT were used for upstream and downstream transmission, respectively.During the first T EQ + T EC training symbols, the coefficients of w (k)  i were not updated in order to initialize S 1 and S 2,i .The vector w (k) i was initialized with all zeroes and a one on the tap corresponding to R (k)  i .The echo signal was asynchronous compared to the received far-end signal.For this design problem, we observe that the split SR-RLS converges approximately 10 times slower than full SR-RLS, which however still fits into the available ADSL training sequence.

CONCLUSIONS
In this paper, we have presented an efficient way to initialize the bank of per-tone equalizers and per-tone echo cancelers in a joint fashion.The proposed initialization algorithm is based on a modification of the full SR-RLS algorithm to obtain a convergence rate and complexity in between NLMS and full SR-RLS.We have shown that the method is convergent in the mean and provided an upper bound for the step size to be used.Finally, we briefly indicated how the presented algorithm could be applied to other DSL applications as well.

APPENDICES A. PROOF CONVERGENCE IN THE MEAN OF THE SPLIT SR-RLS
We start by proving that the convergence of the split SR-RLS algorithm is determined by the cross correlation matrix between the update direction l (k) and the input vector u (k) , that is, where n (k) 0 is the estimation error when applying the optimal Wiener solution w 0 .Now, define the weight error, using (18), as (k) = w (k) − w 0 , = w (k−1) + µl (k)  Taking the statistical expectation of (A.4) yields + µT (k) E u (k) * n (k) 0 , (A.5) where we assumed that T (k) becomes independent of the time index (which holds for stationary inputs and λ < 1).This relation will hold approximately for a slowly time varying T (k)  due to nonstationary inputs.Due to the orthogonality principle [14], the input vector u (k) will be orthogonal to the estimation error when approaching the Wiener solution and hence zeroes the second term in (A.5).According to the traditional "independence assumption" [14]-standardly applied in LMS analyses-the input vector u (k) is independent of (k−1) .Hence, we may write = I T − µX lu E (k−1) . (A.6) The unknowns w (k) converge to the optimal Wiener solution w 0 when E { (k) } = 0 or E {w (k) } = w 0 .This occurs when all eigenmodes of X lu decrease in time.Hence, when

Figure 1 :
Figure 1: Signal flow graph of the split SR-RLS algorithm to initialize the joint PTEQ/PTEC problem.

Figure 2 :
Figure 2: Power spectral densities of received far-end signal, echo, and external noise before and after DFT demodulation for the CSA-1 standard loop.

Figure 3 :
Figure 3: Evolution of the downstream SNR (CSA 1) during convergence for the split SR-RLS scheme with T EQ = 16, T EC /κ = 25, λ = 0.997, and µ = 1.The upper curve indicates the maximal achievable SNR obtained by the MMSE solution for w i .

7 Figure 5 :
Figure 5: Learning curves for the joint PTEQ and PTEC initialization using the original SR-RLS and split SR-RLS scheme.The curves are simulated for upstream CSA loops with T EQ = 40, T EC = 200, λ = 0.999, and µ = 1.