Non-Cancellation Multistage Kurtosis Maximization with Prewhitening for Blind Source Separation

Chi et al. recently proposed two e ﬀ ective non-cancellation multistage (NCMS) blind source separation algorithms, one using the turbo source extraction algorithm (TSEA), called the NCMS-TSEA, and the other using the fast kurtosis maximization algorithm (FKMA), called the NCMS-FKMA. Their computational complexity and performance heavily depend on the dimension of multisensor data, that is, number of sensors. This paper proposes the inclusion of the prewhitening processing in the NCMS-TSEA and NCMS-FKMA prior to source extraction. We come up with four improved algorithms, referred to as the PNCMS-TSEA, the PNCMS-FKMA, the PNCMS-TSEA(p), and the PNCMS-FKMA(p). Compared with the existing NCMS-TSEA and NCMS-FKMA, the former two algorithms perform with signiﬁcant computational complexity reduction and some performance improvements. The latter two algorithms are generalized counterparts of the former two algorithms with the single source extraction module replaced by a bank of source extraction modules in parallel at each stage. In spite of the same performance of PNCMS-TSEA and PNCMS-TSEA(p) (PNCMS-FKMA and PNCMS-FKMA(p)), the merit of this parallel source extraction structure lies in much shorter processing latency making the PNCMS-TSEA(p) and PNCMS-FKMA(p) well suitable for software and hardware implementations. Some simulation results are presented to verify the e ﬃ cacy and computational e ﬃ ciency of the proposed algorithms.


Introduction
Blind source separation (BSS) (or independent component analysis), a problem to extract unknown sources only from observations over multiple sensors, has received wide attention in many areas such as array signal processing, wireless communications, and biomedical signal processing.There have been a number of statistical BSS algorithms reported in the open literature basically including algorithms using second-order statistics (SOS) (known as correlations) [1][2][3], algorithms using higher order statistics (HOS) (known as cumulants) [4][5][6][7][8][9][10][11][12][13][14][15][16][17][18], and a variety of linear and nonlinear algorithms using principles such as maximum-likelihood method and maximum entropy [4,5,19], or using characteristics and features of source signals and the mixing matrix such as nonstationarity and nonnegativity or their combinations [20][21][22][23].SOS-based algorithms such as the algorithm for multiple unkown signals extraction (AMUSE) proposed by Tong et al. [2,6], the second-order blind identification (SOBI) algorithm proposed by Belouchrani et al. [1], and the matrix-pencil approach proposed by Chang et al. [3] generally require the sources to be temporally colored and spatially uncorrelated with different power spectra, while HOS-based algorithms generally require the sources to be non-Gaussian though their power spectra are allowed to be the same.The AMUSE and SOBI algorithms further require P > K (P is the number of sensors, K is the number of sources) and the noise correlation matrix given or estimated in advance, while the matrix-pencil approach requires P ≥ K instead of P > K without need of the noise correlation matrix [3].
Among HOS-based BSS algorithms, the kurtosis (a fourth-order cumulant) maximization criterion has been thought of as an effective source separation criterion, for example, Hyvärinen and Oja's fast fixed-point algorithm (also called the FastICA) [11], Ding and Nguyen's kurtosis maximization algorithm (KMA) [15], Shalvi and Weinstein's superexponential algorithm (SEA) [13,14] and Chi and Chen's fast kurtosis maximization algorithm (FKMA) [16].The FKMA also makes use of the SEA [17] and therefore shares the superexponential convergence rate, and meanwhile guarantees the convergence for finite data length and finite signal-to-noise ratio (SNR).However, the smaller the normalized kurtosis magnitude of the extracted source signal, the worse the performance of these algorithms.Chi and Peng recently proposed another BSS algorithm, called the turbo source extraction algorithm (TSEA) [24], whose performance is quite insensitive to the kurtosis magnitude of the extracted source signal.To extract all the unknown sources, a number of the BSS algorithms have been used in conjunction with a multistage successive cancellation (MSC) procedure for the extraction of all the unknown sources in a one-by-one manner.This procedure, however, is susceptible to error propagation accumulated at each stage and thus limits the quality of extracted source signals at later stages.To circumvent this error propagation problem, Chi and Peng [24] further proposed a non-cancellation multistage (NCMS) framework for FKMA and TSEA, named the NCMS-FKMA and NCMS-TSEA, respectively.It has been shown [24] that the NCMS-FKMA and NCMS-TSEA significantly outperform their MSC-based counterparts as well as some existing BSS algorithms which do not involve any successive deflation procedure.However, the computational complexity and performance of both the NCMS-TSEA and the NCMS-FKMA are heavily dependent on the dimension of multisensor data (i.e., the number of sensors).
In the paper, the prewhitening processing which is effective for dimension reduction and noise reduction is performed prior to source separation using the NCMS-TSEA and NCMS-FKMA, straightforwardly providing two computationally improved algorithms, referred to as the PNCMS-TSEA and the PNCMS-FKMA.The proposed PNCMS-TSEA and PNCMS-FKMA benefit not only from the prewhitening processing but also from their internal computational complexity reduction resultant from the fact that the mixing matrix is converted to a unitary mixing matrix.Specifically, the column vector estimation of the unknown mixing matrix and the computation of orthogonal complimentary projection matrix required in NCMS-TSEA and NCMS-FKMA can be substantially simplified in their PNCMS counterparts.
Since the NCMS-based algorithms require K stages to extract all of the K source signals, they would yield a long processing latency when K is large.Aiming at reducing the processing latency, we further develop two algorithms, namely PNCMS-TSEA(p) and PNCMS-FKMA(p), that are able to extract all the sources with the number of stages that is much smaller than K (typically less than 3 for K ≤ 10).The idea is to employ multiple source extraction modules at each stage for parallel source extraction, and meanwhile maintain the same source extraction performance.Thanks to the unitary mixing matrix involved in the prewhitened data, we find that column vectors of an arbitrary unitary matrix Kurtosis of random variable z; can practically serve as a set of initial conditions (spatial filters) for initializing iterative FKMAs or TSEAs operating in parallel.As a result, the use of the parallel TSEA (FMKA) source extraction modules in PNCMS-TSEA(p) (PNCMS-FKMA(p)) can effectively extract most of the unknown sources at each and every stage.In fact, the PNCMS-TSEA(p) and PNCMS-FKMA(p) are, respectively, the generalized counterparts of the PNCMS-TSEA and PNCMS-FKMA since the former two algorithms reduce to the latter two algorithms if they are artificially constrained to only one source signal extraction at each stage.The performance and computational complexity improvements of the proposed PNCMS-TSEA and PNCMS-FKMA as well as the reduced processing latency of the PNCMS-TSEA(p) and PNCMS-FKMA(p) are verified by computer simulations.The organization of the rest of this paper is as follows.In Section 2, we present the BSS problem and some general model assumptions used.In Section 3, the existing FKMA, TSEA, NCMS-TSEA, and NCMS-FKMA are briefly reviewed.After the presentation of the proposed PNCMS-TSEA and PNCMS-FKMA in Section 4.1, their parallel source extraction counterparts, PNCMS-TSEA(p) and PNCMS-FKMA(p), are presented in Section 4.2.In Section 5, some simulation results are then presented to demonstrate the effectiveness and computational complexity advantages of the proposed algorithms.Finally, some conclusions are drawn in Section 6.

Problem Statement and Assumptions
For ease of later use, let us define the notations shown in Table 1.
Given a set of P sensor measurements, denoted by a ] T , based on the following instantaneous (or memoryless) multiple-input multiple-output (MIMO) model: where where u k [n] is a stationary, zero-mean, non-Gaussian, independent, and identically distributed (i.i.d.) process with a nonzero kurtosis given by [25] The u k [n], k = 1, . . ., K, are mutually statistically independent.(A3) The noise w[n] is zero-mean Gaussian and is statistically independent of s[n].
Let v be a P × 1 source extraction filter (a spatial filter) for processing the signal x[n].Then the filter output is given by Source separation algorithms are essentially developed for designing a set of K source extraction filters, each of which extracts a distinct source signal.Next, let us briefly review the FKMA proposed by Chi and Chen [16,17], the TSEA, the NCMS-FKMA, and the NCMS-TSEA proposed by Chi and Peng [24], respectively.

Review of FKMA, TSEA, NCMS-FKMA, and NCMS-TSEA
3.1.FKMA [16].The FKMA is an iterative algorithm for finding the optimum spatial filter v by maximizing the magnitude of normalized kutosis of e[n] [12,15]: With the assumptions (A1) and (A2) and the noise-free assumption, the optimum spatial filter v is able to extract one of the K source signals, that is, e[n] = α k s k [n] for some k ∈ {1, 2, . . ., K}, where α k is an unknown nonzero constant.At the ith iteration, this algorithm updates 1) [n] R −1  x d e (i−1) [n] (basically with superexponential convergence rate), where 1) [n] = (v (i−1) ) T x[n] and In case of J(v (i) ) < J(v (i−1) ), a gradient-type algorithm is used to update v (i) instead.This iterative algorithm stops when a preassigned convergence tolerance is reached.The computational load of the FKMA is determined by the total number of iterations spent and the dimension of x[n].[24].The TSEA is a cyclically iterative spatial-temporal processing algorithm which maximizes

TSEA
in which g[n] is a single-input single-output temporal filter of order equal to L, that is, g[n] / = 0, n = 0, . . ., L. At the ith cycle, the TSEA consists of the following two steps.
(S1) Compute y (i−1) [n] = g (i−1) [n] * x[n] followed by processing y (i−1) [n] using the FKMA (s) (with v (i−1) as the initial condition) to obtain the optimum spatial filter v (i) and 1) [n].The superscript "(s)" in FKMA (s) indicates that the FKMA is used for the design of the spatial filter v.
and then find the optimum temporal filter g i = [g (i) [0], g (i) [1], . . ., g (i) [L]] T using the FKMA (t) (with g i−1 as the initial condition) and obtain ε  S2) is removed at each cycle, the TSEA reduces to the FKMA.Therefore, the TSEA outperforms the FKMA.For extraction of all the unknown sources without involving any successive cancellation processing, the NCMS-TSEA and NCMS-FKMA have been proposed in [24] and are briefly reviewed in the next subsection.[24].The NCMS-TSEA is a non-cancellation multistage source separation algorithm which, at stage , extracts a distinct source signal estimate, denoted by e [n], and obtains the associated column vector estimate of A, denoted by a .Let C = [a 1 , a 2 , . . ., a −1 ] and let C ⊥ be a P × P projection matrix for which R(C ⊥ ) is orthogonal to R(C ).The NCMS-TSEA includes the following steps.(T3a) Source extraction with

NCMS-TSEA and NCMS-FKMA
, and e [n] = υ T x [n] using the TSEA.Then estimate the associated column vector in the mixing matrix by the input-output-crosscorrelation (IOCC) method [16], that is, and obtain , and e [n] = v T x[n] using the TSEA (with v and g [n] obtained in (T3a) as the initial conditions for v and g[n], resp.).
The NCMS-FKMA is nothing but a special case of the NCMS-TSEA, where the TSEA is replaced by the FKMA in (T3a) and (T3b).The latter outperforms the former simply because the TSEA performs better than the FKMA.It should be mentioned that the NCMS-TSEA and NCMS-FKMA exhibit better performance than their counterparts involving MSC procedure because they are free from error propagation [24].

Improvements of NCMS-TSEA and NCMS-FKMA by Prewhitening
In the section, we present the PNCMS-TSEA and PNCMS-FKMA by incorporating the prewhitening processing in the NCMS-TSEA and NCMS-FKMA, respectively.The prewhitening processing transforms the P × K mixing matrix A into a K × K unitary mixing matrix.As we will show in the first subsection that this simple preprocessing not only can reduce the computational load in source extraction that follows but also can improve the source extraction performance.In the later subsections, we further present a parallel implementation counterpart for the PNCMS-TSEA (PNCMS-FKMA), referred to as the PNCMS-TSEA(p) (PNCMS-FKMA(p)), that is able to extract multiple unknown sources in parallel at each stage and thus able to significantly shorten the processing latency.

NCMS-TSEA with Prewhitening (PNCMS-TSEA).
In the subsection, let us present the PNCMS-TSEA and PNCMS-FKMA that share the advantages of dimension reduction and noise reduction together with further internal computational complexity reduction over the NCMS-TSEA and NCMS-FKMA.The idea is motivated by the following lemma.

Lemma 1. Suppose that the mixing matrix A in x[n]
given by ( 1) is a unitary matrix (i.e., P = K and AA T = I K ) and that the assumptions (A1), (A2) and the noise-free assumption hold.Then the optimum source extraction filter v by finding a local maximum of J(v) (see (5)) is given by v = ±a k for some k ∈ {1, 2, . . ., K}, and the associated extracted source signal is e The proof of Lemma 1 is presented in the Appendix.This lemma implies that when the mixing matrix A is a unitary matrix, the optimum source extraction filter v itself provides an estimate of the column vector a k associated with the extracted source s k [n], and thereby the estimation of a k in (T3a) of the NCMS-TSEA is not needed any more.Moreover, step (T2) in the NCMS-TSEA can be fulfilled without SVD.By Lemma 1, one can simply set which is a semiunitary matrix (i.e., (C ) T C = I −1 ), and thus the corresponding C ⊥ is given by [26] Fortunately, the widely used prewhitening processing through eigenvalue decomposition of the correlation matrix } for both dimension reduction and noise reduction of x[n] can transform the P × K mixing matrix A into a K × K unitary matrix [9,11].With the above computational reduction, noise reduction, and the removal of column vector estimation and SVD in NCMS-TSEA, the proposed PNCMS-TSEA is summarized as follows.
(P1) Prewhitening.Obtain the prewhitening matrix T , where λ 1 , . . ., λ K are the K largest eigenvalues of R x and h 1 , . . ., h K are the associated eigenvectors.The noise variance estimate σ 2 w is obtained as the average of the other (R1) The proposed PNCMS-TSEA basically extracts all the sources through the same signal processing procedure as the NCMS-TSEA except that (P3) for the former is much simpler than (T2) for the latter at each stage, and meanwhile the SVD is not needed in (P3) and no column vector estimation of the mixing matrix is involved in (P4).The inclusion of the prewhitening processing (P1) for the former substantially reduces the computational complexity because the major processing involving FKMA in (P4) is dependent upon the dimension K of the prewhitened data instead of the dimension P (≥K) of the original data, especially when P K.The extra computational load due to the prewhitening processing in (P1) is negligible compared with the amount of computation load reduction in source extraction.
(R2) The performance of the proposed PNCMS-TSEA is basically similar to that of the NCMS-TSEA because the former can be thought of as a more efficient implementation based on the same source separation criterion (kurtosis maximization).However, some performance improvements of the proposed PNCMS-TSEA over the NCMS-TSEA can still be gained for low SNR because twofold noise reduction is performed: Prewhitening processing and source extraction in lower dimension space.
(R3) With the TSEA used in (P4) replaced by the FKMA, the proposed PNCMS-TSEA reduces to the one, called the PNCMS-FKMA.The performance of the former is also superior to the latter simply due to the better performance of the TSEA over the FKMA.
Then the column vector b j represents the column vector in A that is closest to v j .Hence by maximizing J(v) with v j as the initial condition, the optimum spatial filter v may well converge to b j implied by Lemma 1 and the associated output }. Therefore, one can anticipate that the larger the number of distinct column vectors in B, the larger the number of distinct source signals that will be extracted by K TSEA (or FKMA) modules operating simultaneously.In order to get some idea about the average number of distinct column vectors in B, denoted as d(B), let us present some simulation results for a given K × K unitary matrix A for the cases of K = 2, 3, . . ., 10.For each K × K mixing matrix A, 10 5 unitary matrices V were randomly generated and the associated matrices B in (12) were calculated.Table 2

lists the obtained simulation results of d(B).
From Table 2, it can be observed that there are larger than 80% distinct columns in B on average for K ≤ 10, indicating that the use of the K columns of an arbitrary K × K unitary matrix to simultaneously initialize a bank of K TSEA (or FKMA) modules may well yield over 80% distinct sources extracted in the first stage, and thus less than 20% unknown source signals are yet to be extracted at later stages.From these simulation results, one can expect that if at each stage a bank of TSEA (or FKMA) modules is used (each initialized by a column vector of an arbitrary unitary source signals yet to be extracted.Let where C −1 contains the spatial filters that are obtained at stage − 1 and C −1 contains all the obtained spatial filters before stage − 1.As illustrated in Figure 1, for the projected data x [n] = C ⊥ x[n] at stage , the proposed PNCMS-TSEA(p) employs k( ) parallel TSEA modules each using a distinct column vector of an arbitrary K × K unitary matrix as the initial condition to obtain k( ) source signal estimates.Denote by υ r , r = 1, . . ., k( ), the k( ) source extraction filters and let v r = C ⊥ υ r (by (10)) for r = 1, . . ., k( ).Since the k( ) extracted sources may not be distinct altogether, they are needed to identify those source signal estimates and spatial filters associated with distinct sources.To this end, a source classification algorithm that follows the k( ) TSEA modules as shown in Figure 1 is presented next.

Source Signal Classification.
According to Lemma 1, the obtained spatial filters v r , r = 1, . . ., k( ) are also the estimates of the column vectors of the unknown unitary mixing matrix A. Consider the spatial filters v i and v j that maximize J(e[n] = v T x[n]) with the respective outputs given by Under the assumptions (A1), (A2) and the noise-free assumption, the correlation of e i [n] and e j [n] is given by where γ k = E{|s k [n]| 2 }, k = 1, . . ., K, are the source signal powers.Since v i , v j ∈ A (see (13)) by Lemma 1 and all a k are orthonormal to each other, (18) implies that Therefore, the distinct extracted source signals can be identified according to the pairwise correlation of v 1 , . . ., v k( ) .
In other words, if |(v i ) T v j | > η (a threshold between 0 and 1), e i [n], and e j [n] are actually estimates of the same source signal.With our experience, η = 0.5 is a good choice for signal classification.Based on the above analysis, we present a source signal classification algorithm in Table 3, where j( ) represents the number of distinct source signals among the k( ) extracted sources, and ) are the associated spatial filters and temporal filters.Note that the computational load for source signal classification is negligible compared with source extraction.As illustrated in Figure 1, the PNCMS-TSEA(p) at stage ends up with the j( ) unconstrained source extractions using j( ) parallel TSEA modules with (v 1 , g 1 [n]), (v 2 , g 2 [n]), . . ., (v j( ) , g j( ) [n]) as the initial conditions.The PNCMS-TSEA(p) then repeats the above processing procedures (namely, k( ) parallel TSEA source extractions with the projected signal x [n], source classification and j( ) parallel TSEA source extractions with the nonprojected signal x[n]) stage by stage until all the K source signals are extracted.
Five remarks about the proposed PNCMS-TSEA(p) are given as follows.
(R5) The proposed PNCMS-TSEA(p) basically extracts all the sources through the parallel NCMS signal processing and shares the same performance and computational complexity advantages of the PNCMS-TSEA due to the prewhitening processing as described in (R1) and (R2).
(R6) The proposed PNCMS-TSEA(p) reduces the processing latency in separation of all the K source signals since it can simultaneously extract as many distinct sources as possible at each stage in contrast to a single source extracted by the NCMS-TSEA/PNCMS-TSEA at each stage.Specifically, for the scenario within ten sources (K ≤ 10), we empirically found that all of the sources can be extracted within 3 stages.
(R7) With the TSEA used in (F4a) and (F4c) replaced by the FKMA and all the temporal filters g r [n] ignored in the source classification algorithm used in (F4b), the proposed PNCMS-TSEA(p) reduces to the PNCMS-FKMA(p), and again the former is superior to the latter owing to the superior performance of TSEA.(R8) The proposed PNCMS-TSEA (PNCMS-FKMA) can also be thought of as a special case of the proposed PNCMS-TSEA(p) (PNCMS-FKMA(p)).In particular, by setting k( ) = 1 and i( ) = − 1 at (F3), the PNCMS-TSEA(p) reduces to the PNCMS-TSEA presented in Section 4.1, that is, the bank of parallel source separation modules in Figure 1 is replaced by a single source separation module.(R9) The proposed parallel NCMS structure is not limited to the use with the FKMA and the TSEA.Instead, it can be used with any BSS algorithms which basically extract one source at each stage, such as the KMA [13] and the SEA [12], to avoid the error propagation issue in multistage source deflation on one hand, and reduce the processing latency on the other hand.
The FastICA [11], though extracting one source at each stage, cannot be directly applied to the proposed parallel NCMS structure because from the second stage ( ≥ 2), the projected signal involves a mixing matrix C ⊥ A which is no longer a unitary matrix required by FastICA.Therefore, the FastICA cannot be used unless further prewhitening processing is applied to x [n].

Simulation Results
To justify the efficacy of the proposed BSS algorithms, namely the PNCMS-TSEA, PNCMS-FKMA, PNCMS-TSEA(p), and PNCMS-FKMA(p), three parts of simulation results are presented in the section.The first part considers the performance comparison of the proposed BSS algorithms with the existing NCMS-TSEA and NCMS-FKMA for the case of all the source signals with the same power spectrum.The second part focuses on the performance comparison of the proposed BSS algorithms with the existing FastICA [11], SOBI [1], and AMUSE [2] for the case of all the source signals with different power spectra.The third part then presents some results on the computational complexity and processing latency of the proposed four algorithms.The i.i.d.u i [n]'s used for generating s i [n]'s were equiprobable random binary sequences of ±1, and the fifthorder FIR models b i [n]'s in [3] b were used for the generation of s i [n] in (2), where the parameter μ i specifies the power spectrum of the source signal s i [n].The noise vector w[n] was real, zero-mean, spatially independent, and temporally white Gaussian distributed with covariance matrix equal to σ 2 w I P .Three mixing matrices, denoted by A 1 (a 5 × 4 matrix taken from [3]), A 2 (an 8 × 4 matrix), and A 3 (an 8 × 6 matrix) were considered in the simulations as follows: 0.2380 0.2887 −0.7120 0.4914 0.3397 −0.7494 −0.1157 0.2097 0.6107 0.4959 0.2661 0.2504 0.3558 0.2644 −0.4216 −0.6640 −0.5731 −0.1983 −0.4807 0.4593 The synthetic data x[n] with length of 2000 (N = 2000) were generated according to (1) for different values of SNR and processed by each algorithm under test.The convergence criterion used for all the BSS algorithms under test was [24] whenever the FKMA was employed, where v (i) denotes the spatial filter obtained at the ith iteration, and that for the TSEA was where ε (S1) i [n] and ε (S2) i [n] were the ε[n] obtained in (S1) and (S2), respectively, at the ith cycle.Moreover, the initial conditions used for the spatial filter design were v (0) = 1 T P / √ P for NCMS-TSEA as well as for NCMS-FKMA and v (0) = 1 T K / √ K for PNCMS-TSEA as well as for PNCMS-FKMA.For PNCMS-TSEA(p) and PNCMS-FKMA(p), the initial conditions for the parallel spatial filters at stage were randomly selected from k( ) columns of an arbitrary K × K unitary matrix and the threshold η = 0.5 was used for the source signal classification algorithm given in Table 3.Without specific mention, the temporal filters in TSEA modules were initialized by g (0) [n] = δ[n] with L = 5 (the order of the temporal filter g[n]).
One hundred independent trials were conducted for performance evaluation of each BSS algorithm under test.Let v and s k [n] denote the optimum spatial filter and the associated source estimate at the ith simulation trial, respectively.The estimate s k [n] can be expressed as where i [n] = v T w[n] and f i = A T v.The average output signal-to-interference-plus-noise ratio (SINR) associated with s k [n] over the 100 independent trials was calculated as where f i, j denotes the jth entry of the K × 1 vector f i .The total averaged output SINR defined as was used as the performance index of each BSS algorithm.
Part A: Performance Comparison for Sources with the Same Power Spectrum.In this part, we used mixing matrices A 1 (P = 5, K = 4) and A 2 (P = 8, K = 4), respectively, and set the parameters μ i = 0.6 in b i [n] given by ( 21) for i = 1, 2, 3, 4, in order to make all the four source signals have the same power spectrum.Because SOS-based algorithms require different source power spectra, we only focused on the performance comparison between the existing NCMS-TSEA/FKMA and the proposed four algorithms.Figures 2  and 3 show the simulation results (output SINR versus SNR) for A = A 1 and A = A 2 , respectively.One can observe from these two figures that the proposed PNCMS-TSEA(p) ( ), which almost has the same performance as the PNCMS-TSEA ( ), signficantly outperforms the proposed PNCMS-FKMA(p) ( ), and that the proposed PNCMS-FKMA(p) ( ) and PNCMS-FKMA ( ) have similar performance.Moreover, the proposed PNCMS-TSEA ( ) performs better than the NCMS-TSEA ( ) for SNR lower than 15 dB in spite of the same performance for SNR higher than 15 dB, and the proposed PNCMS-FKMA ( ) outperforms the NCMS-FKMA ( ), especially when A = A 2 as shown in Figure 3.These simulation results, which are also consistent with (R2), (R3), (R5), and (R7), well demonstrate the efficacy of the proposed BSS algorithms.

Part B: Performance Comparison for Sources with Different
Power Spectra.In this part, we compare the proposed algorithms with the SOS-based algorithms, namely the AMUSE [2] and SOBI [1] algorithms, and the FastICA using kurtosis [11].The mixing matrix A = A 1 (P = 5, K = 4) was used, and parameters (μ 1 , μ 2 , μ 3 , μ 4 ) = (1, 0.4, 0.  where W is the K × P prewhitening matrix as presented in step (P1) in the proposed PNCMS-TSEA and U is a K × K unitary matrix.All the sources are extracted simultaneously without involving temporal processing as follows: where z[n] = Wx[n] is the prewhitened data vector.
The unitary matrix U obtained by the AMUSE is through eigen decomposition of the correlation matrix ] of the prewhitened signal z[n] for a chosen τ [2], while that obtained by the SOBI algorithm is through joint diagonalization of a set of R z [τ j ] [1].The simulation results to be presented below were obtained with τ = 1 for the AMUSE and with (τ 1 , τ 2 , τ 3 ) = (1, 2, 3) for the SOBI algorithm.
Figure 4 shows the simulation results (output SINR versus SNR).It can be seen that all the observations about the performance of the proposed algorithms from Figures 2 and 3 also apply to Figure 4.One can also observe from this figure that the proposed PNCMS-TSEA(p) ( ) and PNCMS-TSEA ( ) perform best (with highest Output SINR) while the proposed PNCMS-FKMA(p) ( ) and PNCMS-FKMA ( ) algorithms perform second, and all of them exhibit superior performance over the AMUSE ( ) and SOBI (×) algorithms.Moreover, the proposed PNCMS-FKMA ( ) and the FastICA ( ) exhibit similar performance since both the PNCMS-FKMA and the FastICA [11] are NCMS kurtosis maximization algorithms with prewhitening processing performed prior to source extraction, except that the former employs the FKMA while the latter employs a gradient search algorithm (which may not be very computationally efficient) in source extraction.
Case A (number of iterations per source versus SNR).In the case, we calculated the average number of iterations (spent by FKMA) in extracting a single source signal spent by the NCMS-TSEA/NCMS-FKMA and the PNCMS-TSEA/PNCMS-FKMA.Figures 5(a) and 5(b) show the simulation results (average number of iterations per source versus SNR) for mixing matrices A = A 1 and A = A 2 , respectively.One can observe from these two figures that the computational complexity (in terms of the average number of iterations per source) of the proposed PNCMS-TSEA ( ) is smaller than that of the NCMS-TSEA ( ) and their computational complexity differences for A = A 2 are much larger than those for A = A 1 .Furthermore, the source extraction involved in the proposed PNCMS-TSEA was performed with prewhitened data of dimension K = 4 while that involved in the NCMS-TSEA was performed with the original data of dimension P = 5 for the case of A = A 1 and P = 8 for the case of A = A 2 indicating that their computational complexity differences are substantial.These observations also apply to the computational complexity comparison between the PNCMS-FKMA ( ) and the NCMS-FKMA ( ).
To show the effectiveness of the parallel source extraction structure of the PNCMS-TSEA(p)/PNCMS-FKMA(p) about processing latency reduction, we, respectively, calculated their average number of stages and the average total processing latency defined as Average total processing latency = 1 100 where F q is the total number of stages required in trial q and I ( ) q stands for the maximum number of iterations (spent in extracting one source) among the k( ) parallel TSEA/FKMA modules in (F4a) plus that in (F4c).For the PNCMS-TSEA/PNCMS-FKMA, the average processing latency was simply the total number of iterations spent in extracting all the source signals averaged over 100 independent runs.
Case B (average number of stages versus SNR).The simulation results for the average number of stages spent by the proposed BSS algorithms are presented in Figure 6(a) for A = A 1 and in Figure 6(b) for A = A 3 , respectively.One can observe from these two figures that the average number of stages of the proposed PNCMS-TSEA(p) ( ) for the case of A = A 1 is less than 2 and that for the case of A = A 3 is less than 3 in contrast to the 4 stages and 6 stages required by the NCMS-TSEA ( ) for A = A 1 and A = A 3 , respectively.These observations also apply to the comparison between the PNCMS-FKMA(p) ( ) and the NCMS-FKMA ( ).
Case C (average processing latency versus SNR).Figures 7(a) and 7(b) show the simulation results of the average processing latency (defined in (29)) for mixing matrices A = A 1 and A = A 3 , respectively.It can be observed from these two figures that the average total processing latency of the proposed PNCMS-TSEA ( ) is much less than that of the NCMS-TSEA ( ), and the average total processing latency of the proposed PNCMS-TSEA(p) (•) can be further reduced.

Conclusion
By incorporating the prewhitening processing (which has been widely used for noise reduction and dimension reduction) into the existing NCMS-TSEA and NCMS-FKMA, we have presented four improved non-cancellation multistage BSS algorithms, the PNCMS-TSEA, the PNCMS-FKMA, the PNCMS-TSEA(p), and the PNCMS-FKMA(p), respectively.In contrast to the NCMS-TSEA (NCMS-FKMA), we have shown that the proposed PNCMS-TSEA (PNCMS-FKMA) has much simplified processing in source extraction and column vector estimation of the mixing matrix along with significant computational savings on one hand, and some performance improvements on the other hand, especially when the number of sensors is much larger than the number of sources.The proposed PNCMS-TSEA(p) (PNCMS-FKMA(p)) not only enjoys the same performance advantages as the PNCMS-TSEA (PNCMS-FKMA) but also has much smaller processing latency owing to the parallel source  extraction structure at each stage.We have shown that parallel TSEA (FKMA) modules in the PNCMS-TSEA(p) (PNCMS-FKMA(p)) can extract most of the unknown source signals at each stage by simply using columns of an arbitrary K ×K unitary matrix as the initial conditions for the spatial filters of TSEA or FKMA.Our simulation results have demonstrated performance and computational complexity improvements by the PNCMS-TSEA and PNCMS-FKMA as well as substantial processing latency reduction by the PNCMS-TSEA(p) and PNCMS-FKMA(p).In particular, we empirically found that the PNCMS-TSEA(p) and PNCMS-FKMA(p) can extract all the source signals within 3 stages for K ≤ 10, meanwhile with more than 30% reduction in the processing latency compared to the PNCMS-TSEA and PNCMS-FKMA, respectively.

Appendix
Proof of Lemma 1.With the assumptions (A1), (A2) and the noise-free assumption, it has been proven in [15] that J(e[n]) given by ( 5) attains maximum (either locally or globally regardless of whether A is unitary or not) if and only if where α k is an unknown nonzero constant and the unknown integer k ∈ {1, 2, . . ., K}.Note that the columns a i , i = 1, 2, . . ., K, of the K × K unitary matrix A form an orthonormal basis for R(A), that is, a i = 1 and a T i a j = 0 for all i / = j.Therefore, it can be inferred from (A.1) and (A.2) that v T a k = α k and v T a i = 0 for all i / = k, and therefore it must be true that v = β k a k .Since v = 1 we have α k = β k = ±1.Lemma 1 then is proved.

(
T1) Set = 0, C ⊥ 1 = I P and x 1 [n] = x[n].(T2) Update by +1.If ≥ 2, obtain C ⊥ through singular value decomposition (SVD) of C , and compute the projected data x [n] = C ⊥ x[n] (which basically consists of all the contributions associated with the sources that have not yet been extracted).

Step 4 :Step 5 :
Obtain the pair (v p , g p [n]) ∈ S where v p = arg max v∈Cp J e[n] = v T x [n] .Repeat Step 2 to Step 4 until p i=1 C i = S 1 , where j( ) = p is the total number of distinct sources.

Figure 2 :Figure 3 :
Figure 2: Performance comparison results of Part A (output SINR versus SNR) for A = A 1 .

Figure 4 :
Figure 4: Performance comparison results of Part B (output SINR versus SNR) for A = A 1 .

Figure 5 :
Figure 5: Complexity comparison results of Case A in Part C (number of iterations per source versus SNR): (a) A = A 1 and (b) A = A 2 .

Figure 6 :
Figure 6: Complexity comparison results of Case B in Part C (average number of stages versus SNR): (a) A = A 1 and (b) A = A 3 .

Figure 7 :
Figure 7: Processing latency comparison results of Case C in Part C (average total processing latency versus SNR): (a) A = A 1 and (b) A = A 3 .
TSEA (PNCMS-FKMA), but at each stage the former uses multiple TSEA (FKMA) modules in parallel for multiple source extractions followed by source signal classification.The parallel source extraction will be shown to be effective in cutting down the total number of stages in extracting all the source signals and thus effective in shortening the processing latency.4.2.1.Initial Conditions for Parallel Source Extraction.Prior to presenting the proposed PNCMS-TSEA(p) in detail, let us first address the idea of how to efficiently extract as many distinct source signals as possible via the use of parallel TSEA modules at each stage.Since different initial conditions for the spatial filter v to initialize either FKMA or TSEA may result in different source signals to be extracted, this fact implies the need of proper initial conditions for each TSEA module.Thanks to the prewhitening processing which transforms the mixing matrix A into a unitary matrix, we empirically found that K distinct initial conditions taken from the K columns of an arbitrary K × K unitary matrix can end up with distinct source signals to be extracted with a high fraction of K. To illustrate this, given a K × K unitary mixing matrix A is large, a long processing latency is apparently inevitable.In order to reduce the processing latency, in the subsection, by making use of a parallel source extraction structure for the PNCMS-TSEA and PNCMS-FKMA we propose two more algorithms, referred to as the PNCMS-TSEA(p) and PNCMS-FKMA(p), respectively.The PNCMS-TSEA(p) (PNCMS-FKMA(p)) basically has the same performance as the PNCMS-= [a 1 , . . ., a K ] we define a K ×

Table 2 :
Average number of distinct columns in B, d(B), for K = 2 to K = 10.

Table 3 :
Source signal classification algorithm.
3, 0.2) were set for b i [n], i = 1, 2, 3, 4, so that the four source signals would have different power spectra.Both of the AMUSE and SOBI algorithms design the P × K source separation matrix (demixing matrix)