 Research
 Open Access
 Published:
Separation of instantaneous mixtures of a particular set of dependent sources using classical ICA methods
EURASIP Journal on Advances in Signal Processing volume 2013, Article number: 62 (2013)
Abstract
This article deals with the problem of blind source separation in the case of a linear and instantaneous mixture. We first investigate the behavior of known independent component analysis (ICA) methods in the case where the independence assumption is violated: specific dependent sources are introduced and it is shown that, depending on the source vector, the separation may be successful or not. For sources which are a probability mixture of the previous dependent ones and of independent sources, we introduce an extended ICA model. More generally, depending on the value of a hidden latent process at the same time, the unknown components of the linear mixture are assumed either mutually independent or dependent. We propose for this model a separation method which combines: (i) a classical ICA separation performed using the set of samples whose components are conditionally independent, and (ii) a method for estimation of the latent process. The latter task is performed by iterative conditional estimation (ICE). It is an estimation technique in the case of incomplete data, which is particularly appealing because it requires only weak conditions.
1 Introduction
For the last decades, blind source separation (BSS) has been an active research problem: this popularity comes from the wide panel of potential applications such as audio processing, telecommunications, biology, etc. In the case of a linear multiinput/multioutput (MIMO) instantaneous system, BSS corresponds to independent component analysis (ICA), which is now a well recognized concept [1]. Contrary to other frameworks where techniques take advantage of a strong information on the diversity, for instance through the knowledge of the array manifold in antenna array processing, the core assumption in ICA is much milder and reduces to the statistical mutual independence between the inputs. However, the latter assumption is not mandatory in BSS. For instance, in the case of static mixtures, sources can be separated if they are only decorrelated, provided that their nonstationarity or their color can be exploited. Other properties such as the fact that sources belong to a finite alphabet can alternatively be utilized [2, 3] and do not require statistical independence. We consider in this article the case of dependent sources without assuming nonstationarity nor color.
To the best of authors’ knowledge, only few references have tackled the issue of dependent source separation [4–15], although the interest in dependent sources has been witnessed by studies in various applied domains such as cosmology [6, 13, 14], biology/medicine [7, 8, 16], feature extraction [17]. Among the interesting proposed extensions of ICA to dependent components, we should mention treedependent models [11] and models with dependence in variance profiles [12]. Contrary to the mentioned articles, our approach is based on the selection of an appropriately chosen subsample of the available data, which then feeds the entry of a classical ICA method.
Among ICA or BSS methods, one can distinguish two approaches: some methods recover the sources one by one, which is what we refer to as multiinput/singleoutput (MISO) approaches. These approaches are often used in conjunction with a socalled deflation procedure [18, 19]. In contrast, other approaches, which will be referred to as MIMO recover all the sources simultaneously.
Inspired from [20–23], we investigate in a first part of the article the behavior of the kurtosis contrast function: this criterion is wellknown in MISO BSS approaches and we study some of its properties in some specific cases of dependent sources.
In a second part of the article, we investigate a particular model which combines an ICA model with a probabilistic model on the sources, making them either dependent or independent at different time instants. Our method exploits the “independent part” of the source components. Although it is possible to refine our model by introducing a temporal dependence, it assumes neither nonstationarity nor color of the sources. We would like to outline the difference between our study and [17]: the latter assumes a conditional independence of the sources, whereas, depending on a hidden process, we assume either conditional independence or dependence. The proposed separation method which is introduced relies on iterative conditional estimation (ICE), which has been introduced recently [24].
The considered model and notations are specified in Section 2. In Section 2, specific dependent sources are introduced and the behavior of the kurtosis contrast function is investigated. Then, Section 2 introduces a genuine model of dependent sources, for which separation is possible. The principles of our method and a discussion on ICE are provided in Section 2. The algorithm is precisely described in Section 2, where a parallel is also made with the acceptreject random generation method. Some simulations are provided in Section 2 and Section 2 concludes authors’ study.
2 Mixture model
2.1 Linear mixture
We consider a set of T samples of vector observations. At each time instant t∈{1,…,T} the observed vector is denoted by x(t)≜(x _{1}(t),…,x _{ N }(t))^{T}. We assume that these observations result from a linear mixture of N unknown and unobserved source signals. More precisely and in other words, there exists a matrix $\mathbf{A}\in {\mathbb{R}}^{N\times N}$ and a vector valued process s(t)≜(s _{1}(t),…,s _{ N }(t))^{T} such that:
Let X≜(x(1),…,x(T)) be the N×T matrix with all samples of the observations and S≜(s(1),…,s(T)) be the N×T matrix with all sources samples. The matrix A is unknown and the objective consists in recovering S from X only: this is the socalled blind source separation problem. We will assume here that A is a square leftinvertible matrix and the problem thus reduces to the estimation of A or its inverse. A solution has been developed for long and is known as ICA [1]. It generally requires two assumptions: the source components should be non Gaussian—except possibly one of them—and they should be statistically mutually independent. With these assumptions, it is known that one can estimate a matrix $\mathbf{B}\in {\mathbb{R}}^{N\times N}$ such that y(t)=B x(t) restores the sources up to some ambiguities, namely ordering and scaling factors. In this article, with no loss of generality, we assume that the sources are zeromean and have unit power. Finally, note that if A is a tall matrix (i.e., there are more observations than sources), a dimensionality reduction technique such as the principal component analysis (PCA) can be used to obtain a mixture with as many observations as sources.
2.2 Notations
In the following, B denotes the estimated inverse of A and is referred to as the separating matrix. Defining $\mathbf{G}\triangleq \mathbf{BA}$ the combined mixingseparating matrix, the BSS problem is solved if G is a socalled trivial matrix, i.e., the product of a diagonal matrix with a permutation: these are well known ambiguities of BSS.
In Section 2, we will study separation criteria as functions of G. Source separation sometimes proceeds iteratively, extracting one source at a time (e.g., deflation approach). In this case (and particularly in Section 2), we will write y(t)=b x(t)=g s(t) where b and g=b A, respectively, correspond to a row of B and G and y(t) denotes the only output of the separating algorithm. In this case, the separation criteria are considered as functions of g.
Finally, Cum {.} denotes the cumulant of a set of random variables (see e.g., [1]) and Var{.} denotes the variance of a random variable. For a random vector $\mathbf{s}={({s}_{1},\dots ,{s}_{N})}^{{}^{\mathrm{T}}}$ and for any multiindex i=(i _{1},…,i _{ N }), we introduce the notation:
We denote by $\mathcal{N}(\mu ,{\sigma}^{2})$ the Gaussian law with mean μ and variance σ ^{2} and by $\mathcal{\mathcal{L}}\left(\lambda \right)$ the Laplace (i.e., doubleexponential) distribution with zeromean and scale parameter λ. The symbol ∼ denotes the law followed by a random variable or equality between probability distributions; the conditional distribution of a random variable (or vector) r knowing X under a parameter value θ is denoted by $\mathbb{P}\left(\mathbf{r}\phantom{\rule{0.3em}{0ex}}\right\phantom{\rule{0.3em}{0ex}}\mathbf{X};\theta )$. Finally, we denote by δ(.) the Kronecker function which equals 1 if the argument in parenthesis holds true and 0 otherwise.
3 A class of dependent sources
In this section, we introduce particular dependent sources based on products of independent signals. These models were shown to be useful when dealing with underdetermined [20] and nonlinear [23] mixtures. We assume that all processes are stationary. At each time instant t, the vectors s(t) and x(t) are realizations of random vectors. Since no confusion is possible, in this section only, we drop the time index t and these vectors are denoted, respectively, by s and x.
3.1 Three dependent sources
3.1.1 Specific sources and properties
Binary phase shift keying (BPSK) signals have specificity that will allow us to obtain source vectors with interesting properties. By definition, BPSK sources take values s=+1 or s=1 with equal probability 1/2. We define the following source vector:
A1. Let ϵ be a BPSK random variable and a a realvalued non Gaussian random variable with nonzero fourthorder cumulant ${\kappa}_{4}^{\left(a\right)}\ne 0$. We assume also that a is independent of ϵ and $\mathbb{E}\left\{a\right\}=\mathbb{E}\left\{{a}^{3}\right\}=0$, $\mathbb{E}\left\{{a}^{2}\right\}=1$. Then, we define the source vector $\mathbf{s}\triangleq {({s}_{1},{s}_{2},{s}_{3})}^{{}^{\mathrm{T}}}$ as follows:
Interestingly, the following lemma holds true:
Lemma 1
The sources s _{1},s _{2},s _{3} defined by A1 are zeromean, unit variance, mutually dependent, decorrelated and their fourthorder crosscumulants values are such that:
Proof
From A1, s _{1}=a and s _{3}=ϵ have zeromean by definition and so does s _{2}=a ϵ by independence of a and ϵ. Hence $\mathbb{E}\left\{{s}_{1}\right\}=\mathbb{E}\left\{{s}_{2}\right\}=\mathbb{E}\left\{{s}_{3}\right\}=0$ and for such centered random variables, it is known that cumulants can be expressed in terms of moments as follows:
It is then possible to check all cases of Equations (2) and (3), using again A1. We obtain that (2) vanishes for i≠j and the decorrelation of the sources follows. The values of the fourthorder cumulants in the lemma are obtained similarly.
On the other hand, the third order crosscumulant reads:
and this proves that s _{1},s _{2},s _{3} are mutually dependent. □
Depending on s _{1}=a, more can be proved about the source vector defined by A1. For example, if the probability distribution of a is symmetric, then s _{2} and s _{3} are independent. On the contrary s _{1} and s _{2} are generally not independent. An even more specific case is obtained when s _{1}=a is itself BPSK. Using in Lemma 1 the fact that, in the latter specific case ${s}_{2}^{2}={a}^{2}=1$, and calculating also the pairwise probability functions, we obtain the following result:
Lemma 2
Consider the source vector defined by A1. If in addition s _{1}=a is BPSK, the source vector s satisfies:

each component s _{ i } (i=1,2 or 3) is BPSK,

(s _{1},s _{2},s _{3}) are mutually dependent,

(s _{1},s _{2},s _{3}) are pairwise independent and hence decorrelated,

all fourth order cross cumulants of s vanish, that is:
$$\begin{array}{ll}{\kappa}_{4,0,0}^{\left(\mathbf{s}\right)}& ={\kappa}_{0,4,0}^{\left(\mathbf{s}\right)}={\kappa}_{0,0,4}^{\left(\mathbf{s}\right)}=2\phantom{\rule{2em}{0ex}}\\ {\kappa}_{{i}_{1},{i}_{2},{i}_{3}}^{\left(\mathbf{s}\right)}& =0\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\text{for any other}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}{i}_{1}+{i}_{2}+{i}_{3}=4.\phantom{\rule{2em}{0ex}}\end{array}$$(4)
3.1.2 Properties of the kurtosis contrast function with sources satisfying A1
Consider a source vector s which satisfies Assumption A1. If in addition s _{1}=a is BPSK, the arguments in [22] prove that a mixture of such a source vector can be separated by many classical ICA algorithms such as CoM2 [25], JADE [26], and FastICA [27]: this follows straightforwardly from Lemma 2 and the fact that the corresponding algorithms rely only on the vanishing of the crosscumulants of the sources, that is on Equation (4). We now extend this result to more general distributions of a and state the following proposition^{a}:
Proposition 1
Let y=g s where the vector of sources is defined by A1. If ${\kappa}_{4}^{\left(a\right)}={\kappa}_{4,0,0}^{\left(\mathbf{s}\right)}\ne 2$ the function
defines a MISO contrast function, that is, its maximization over the set of unit norm vectors (∥ g∥^{2}=1) leads to a vector g with only one nonzero component. More precisely, depending on the fourth order cumulant ${\kappa}_{4}^{\left(a\right)}={\kappa}_{4,0,0}^{\left(\mathbf{s}\right)}$ of s _{1} :

If ${\kappa}_{4}^{\left(a\right)}>2$ , the maximization of ( 5) leads to extraction of either s _{1}=a or s _{3}=ϵ,

if $2<{\kappa}_{4}^{\left(a\right)}<2$ , the maximization of ( 5) leads to extraction of s _{3}=ϵ,

if ${\kappa}_{4}^{\left(a\right)}=2$ , the maximization of ( 5) leads to extraction of either one of the sources s _{1},s _{2} or s _{3},

if ${\kappa}_{4}^{\left(a\right)}=2$ , ( 5) does not define any contrast function and its maximization leads either to extraction of s _{3} or to a mixture of s _{1} and s _{2} only.
Remark 1
Since a is zeromean, unitvariance, the fourthorder cumulant satisfies ${\kappa}_{4}^{\left(a\right)}\ge 2$. This follows from $\text{Var}\left\{{a}^{2}\right\}=\mathbb{E}\left\{{a}^{4}\right\}{\mathbb{E}\left\{{a}^{2}\right\}}^{2}\ge 0$ and Equation (3). All cases are hence given in the above proposition.
Remark 2
The above result characterizes the global maximum of the criterion. However, one should remember that most optimization algorithms search for a local maximum only and may therefore fail to reach the global maximum: however, in simulations (Section 2), we did not observe convergence to any spurious local maximum. The same remark holds for Propositions 2, 3, and 4.
Proof. First note that for all α≥1, the criterion in (5) reaches its maxima for the same values of g. We hence consider α=1 in the proof. Using y=g s, the multilinearity of the cumulants and Lemma 1 which holds for sources satisfying A1, we obtain:
Since the maxima of the criterion (5) under the constraint $\parallel \mathbf{g}{\parallel}^{2}={g}_{1}^{2}+{g}_{2}^{2}+{g}_{3}^{2}=1$ are either minima or maxima of ${\kappa}_{4}^{\left(y\right)}$ under the same constraint, we introduce the following Lagrangian:
The sought extrema necessarily satisfy:
The corresponding system is polynomial. Using wellknown algebraic techniques now implemented in many computer algebra systems [28], one can get a system of triangular equations equivalent to (7), whose solutions can be given explicitly. For ${\kappa}_{4}^{\left(a\right)}\ne 2$, there are 26 solutions to (7), some of them being complexvalued depending on ${\kappa}_{4}^{\left(a\right)}$. In Table 1, we give the realvalued solutions to (7) and the corresponding value of ${\kappa}_{4}^{\left(y\right)}$. We also indicate for which values of ${\kappa}_{4}^{\left(a\right)}$ the solutions hold.
For all values of ${\kappa}_{4}^{\left(a\right)}\ne 2$ in Proposition 1 one can then consider all potential maxima in Table 1 and check which one maximizes $\left{\kappa}_{4}^{\left(y\right)}\right$.
For ${\kappa}_{4}^{\left(a\right)}=2$, then ${\kappa}_{4}^{\left(y\right)}=2{({g}_{1}^{2}+{g}_{2}^{2})}^{2}2{g}_{3}^{4}$ and, using again the Lagrangian, it can be checked by hand that the extrema of ${\kappa}_{4}^{\left(y\right)}$ subject to ∥g∥^{2}=1 satisfy either ${g}_{1}={g}_{2}=0,{g}_{3}^{2}=1$ or ${g}_{1}^{2}+{g}_{2}^{2}=1,{g}_{3}=0$. □
Amazingly, the above result does not hold any longer if one considers a mixture of only the first two components of the sources given by Assumption A1.
Proposition 2
Let y=gs where the vector of sources is given by the first two components s=(s _{1},s _{2})^{T} of the sources defined by A1. The function in Equation ( 5) satisfies:

If ${\kappa}_{4}^{\left(a\right)}<\frac{2}{7}$ or ${\kappa}_{4}^{\left(a\right)}>2$ , it is a contrast function and its maximization leads to extraction of either s _{1} or s _{2} (that is: g=(±1,0) or g=(0,±1)),

if $\frac{2}{7}\le {\kappa}_{4}^{\left(a\right)}\le 2$, it is not a contrast function and its maximization leads to a nonseparating solution of the type $\mathbf{g}=(\pm \frac{1}{\sqrt{2}},\pm \frac{1}{\sqrt{2}})$.
Proof
The proof is similar to the proof of Proposition 1. Indeed, we have ${\kappa}_{4}^{\left(y\right)}={\kappa}_{4}^{\left(a\right)}({g}_{1}^{4}+{g}_{2}^{4})+\left({\kappa}_{4}^{\left(a\right)}+2\right){g}_{1}^{2}{g}_{2}^{2}$ and the Lagrangian reads $\mathcal{\mathcal{L}}={\kappa}_{4}^{\left(a\right)}({g}_{1}^{4}+{g}_{2}^{4})+\left({\kappa}_{4}^{\left(a\right)}+2\right){g}_{1}^{2}{g}_{2}^{2}\lambda ({g}_{1}^{2}+{g}_{2}^{2}1)$. Using a computer algebra system, one can then check that (7) is satisfied at 8 points. These points correspond to values of (g _{1},g _{2}) and ${\kappa}_{4}^{\left(y\right)}$ which are given in the first three rows of Table 1 (precisely those rows for which g _{3}=0). Then, (5) is a contrast function if and only if its maximization yields a separating solution, that is if and only if $\left{\kappa}_{4}^{\left(a\right)}\right>\left\frac{3{\kappa}_{4}^{\left(a\right)}+2}{4}\right$. The proposition then follows easily. □
3.2 Pairwise and mutual independence
3.2.1 Pairwise independent sources
We now investigate the particular case of pairwise independent sources and introduce the following source vector:
A2. s=(s _{1},s _{2},s _{3},s _{4})^{T} where s _{1},s _{2} and s _{3} are independent BPSK and s _{4}=s _{1} s _{2} s _{3}.
This case has been considered in [20], where it has been shown that
and all other crosscumulants vanish. The latter cumulant value shows that the sources are mutually dependent; although it can be shown that they are pairwise independent. It should be clear that pairwise independence is not equivalent to mutual independence but in an ICA context, it is relevant to recall the following proposition, which is a direct consequence of Darmois’ theorem ([25], p. 294):
Property 1
Let s be a random vector with mutually independent components, and x=G s. Then the mutual independence of the entries of x is equivalent to their pairwise independence.
Based on this proposition, the ICA algorithm in [25] searches for an output vector with pairwise independent components. Let us stress that this holds only if the source vector has mutually independent components: pairwise independence is indeed not sufficient to ensure identifiability as we will see in following section.
3.2.2 Pairwise independence is not sufficient
We first have the following preliminary result:
Lemma 3
Let y=gs where the vector of sources is defined by A2 . Assume that the vector (s _{1},s _{2},s _{3}) takes all 2^{3} possible values. If the signal y has values in {1,+1}, then g=(g _{1},g _{2},g _{3},g _{4}) is either one of the solutions below:
Proof. If y=g s, using the fact that ${s}_{i}^{2}=1$ for i=1,…,4, we have with the particular sources given by A2:
Since (s _{1},s _{2},s _{3}) take all possible values in {1,1}^{3}, we deduce from y ^{2}=1 that the following equations necessarily hold:
First observe that values given in (9) indeed satisfy (10). Yet, if a polynomial system of N equations of degree d in N variables admits a finite number of solutions^{b}, then there can be at most d ^{N} distinct solutions. Hence we have found them all in (9), since (9) provides us with 16 solutions for (g _{1},g _{2},g _{3},g _{4}). □
Using the above result, we are now able to specify the output of classical ICA algorithms when applied to a mixture of sources which satisfy A2.
Constant modulus and contrasts based on fourth order cumulants
The constant modulus (CM) criterion is one of the most known criteria for BSS. In the real valued case, it simplifies to:
Proposition 3
For the sources given by A2 , the minimization of the constant modulus criterion with respect to g leads to either one of the solutions given by Equation ( 9).
Proof. We know that the minimum value of the constant modulus criterion is zero and that this value can be reached (for g having one entry being ±1 and other entries zero). Moreover, the vanishing of the constant modulus criterion implies that y ^{2}1=0 almost surely and one can then apply Lemma 3. □
A connection can now be established with the fourthorder autocumulant if we impose the following constraint:
Because of the scaling ambiguity of BSS, the above normalization can be freely imposed. Under (12), we have ${\kappa}_{4}^{\left(y\right)}=\mathbb{E}\left\{{\left({y}^{2}1\right)}^{2}\right\}2$ and minimizing J _{CM}(g) thus amounts to maximizing ${\kappa}_{4}^{\left(y\right)}$. Unfortunately, since ${\kappa}_{4}^{\left(y\right)}$ may be positive or negative, no simple relation between $\left{\kappa}_{4}^{\left(y\right)}\right$ and J _{CM}(g) can be deduced from the above equation. Recall that usual results on contrasts do not apply here since source s _{4} depends on the others ([1], pp. 83–85). However, we can state:
Proposition 4
Let y=g s where the vector of sources is defined by A2. Then, under the constraint ( 12) (∥g∥=1), we have:
(i) The maximization of $\mathbf{g}\mapsto {\kappa}_{4}^{\left(y\right)}$ leads to either one of the solutions given by Equation ( 9).
(ii) $\left{\kappa}_{4}^{\left(y\right)}\right\le 2$ and the equality $\left{\kappa}_{4}^{\left(y\right)}\right=2$ holds true if and only if g is one of the solutions given in Equation ( 9).
Proof. Part (i) follows from the arguments given above. In addition, using multilinearity of the cumulants and (8), we have for y=g s:
The result then follows straightforwardly from the study of the polynomial function in Equation (13). Indeed, optimizing (13) leads to the following Lagrangian:
After solving the polynomial system which cancels the Jacobian of the above expression, one can check that all solutions are such that $\left{\kappa}_{4}^{\left(y\right)}\right\le 2$. Part (ii) of the proposition easily follows. □
3.3 Simulations
3.3.1 Context
We illustrate with a few simulations Propositions 1 and 2. The random variable a in Assumption A1 has been generated as the following mixture of Gaussians:
μ can take any value in ]0,1[ and we have set σ ^{2}=1μ ^{2}. The latter choice ensures that $\mathbb{E}\left\{{a}^{2}\right\}=1$, whereas we obviously have $\mathbb{E}\left\{a\right\}=\mathbb{E}\left\{{a}^{3}\right\}=0$. Finally, we have ${\kappa}_{4}^{\left(a\right)}=2{\mu}^{4}$ and we choose the particular values μ=0.5 and μ=0.9, which corresponds respectively to ${\kappa}_{4}^{\left(a\right)}=0.125$ and ${\kappa}_{4}^{\left(a\right)}\approx 1.3122$.
We generated mixtures of the sources given by Assumption A1: we mixed the three sources s _{1},s _{2},s _{3} or the two sources s _{1},s _{2} only with a matrix A randomly generated in ${\mathbb{R}}^{3\times 3}$ or ${\mathbb{R}}^{2\times 2}$, respectively.
3.3.2 Algorithm
We used the algorithm CoM2 described in [25] and ([1], Chap. 5). It relies on the following MIMO extension of criterion (5):
The algorithm in [25] first operates a prewhitening and the maximization of the above criterion is performed over the set of orthogonal matrices. From the results of Propositions 1 and 2 we expect the separation results which are given in Table 2, depending on ${\kappa}_{4}^{\left(a\right)}$ and the number of mixed sources. In Table 2, G is said separating when Above, G is said separating when G=P D where P is a permutation matrix and D=diag(±1,…,±1) and G extracts s _{3} when $\mathbf{G}=\mathbf{P}\left(\begin{array}{c}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\ast \phantom{\rule{0.3em}{0ex}}\ast \phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}0\\ \phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\ast \phantom{\rule{0.3em}{0ex}}\ast \phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}0\\ 0\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}0\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}1\end{array}\right)$and theelements denoted ∗ have absolute value $1/\sqrt{2}$.
3.3.3 Results
We provide some results illustrating Propositions 1 and 2 through the behavior of the algorithm CoM2. Let us define the following performance criterion:
We have 0≤τ _{ k }≤1 and τ _{ k } is close to zero whenever source k is well separated.
We performed 100 MonteCarlo runs both in the case where the three sources A1 and in the case where only s _{1},s _{2} are mixed. The average results are provided in Tables 3 and 4. We see in Table 3 that τ _{3} is small in all cases, indicating that the source s _{3} is indeed separated by the algorithm. On the contrary, one can see in Tables 3 and 4 that τ _{1},τ _{2} are small only for μ=0.9: this corresponds to a value of ${\kappa}_{4}^{\left(a\right)}$ for which the obtained g is theoretically separating. On the contrary, for μ=0.5, corresponding to a value of ${\kappa}_{4}^{\left(a\right)}$ for which the obtained G does not theoretically separate s _{1} and s _{2}, the values of τ _{1} and τ _{2} are close to 0.5.
4 Extended ICA model
We have seen that, depending on the value of ${\kappa}_{4}^{\left(a\right)}$, the classical optimization criteria in Equations (5) and (16) are not contrasts any longer for the first two sources given by Assumption A1. We now introduce a new statistical model of dependent sources, which consists in a probability mixture of sources. One component of the probability mixture satisfies the requirement of ICA, whereas the other component of the probability mixture is dependent. As an interesting example of dependent sources, we will consider the first two sources defined by A1, where a is the mixture of Gaussians proposed in Section 2 with μ=0.5: this choice is justified by the previous results, which state that such sources cannot be separated by classical algorithms. We show that for our model, the separation is possible based on ICA and on the subset of samples where ICA assumptions are satisfied.
4.1 Latent variables
We first extend ICA methods and relax the independence assumption. The basic idea consists in introducing a hidden process r(t) such that, depending on the particular value of r(t) at instant t, the independence assumption is relaxed at time t. In this article, we will assume that r(t) can take two values only in the set {0,1}. Let r≜(r(1),…,r(T)). We assume more precisely:
A3. Conditionally on r, the components s(1),…,s(T) of S at different times are independent and for all t∈{1,…,T} we have $\mathbb{P}\left(\mathbf{s}\right(t\left)\phantom{\rule{0.3em}{0ex}}\right\phantom{\rule{0.3em}{0ex}}\mathbf{r})=\mathbb{P}(\mathbf{s}\left(t\right)\phantom{\rule{0.3em}{0ex}}\left\phantom{\rule{0.3em}{0ex}}r\right(t\left)\right)$.
A4. Conditionally on r(t), when r(t)=0, the components of s(t) are mutually independent and non Gaussian, except possibly one of them;
A5. Conditionally on r(t), when r(t)=1, the components of s(t) are dependent.
One can see that, conditionally on r(t)=0, the source components s(t) at time t satisfy the usual assumptions required by ICA. In a BSS context, if r were known, one could easily apply ICA techniques by discarding the time instants where the sources do not satisfy the ICA assumptions. To be more precise, let us define:
The set ${\mathcal{I}}_{0}$ is the set of time instants where the components of s(t) are independent and non Gaussian, except possibly one of them. Then the subset X _{0} of the whole set X of the observations satisfies the assumptions usually required by ICA techniques. The core idea of our method consists in performing alternatively and iteratively an estimation of B (corresponding to A ^{1}) and of the hidden data r.
4.2 Typical sources separated by the proposed method
The sources that we consider satisfy Assumptions A3., A4., and A5. given previously in Section 2. We now detail our particular choices for A4., A5., which have been used in simulations and in the sequel to illustrate the assumptions. These particular choices are denoted hereunder by A4.’ and A5.’ or A5.”.
We have considered two particular examples with N = 2 sources. In both cases, Assumption A4. particularizes to:
A4.’ When r(t)=0, the components of s(t) are mutually independent, uniformly distributed on $[\sqrt{3},\sqrt{3}]$ (that is: zeromean and unit variance).
4.2.1 Example 1
As a first example, we particularize Assumption A5. as follows:
A5.’ When r(t)=1, then s _{1}(t)=a(t) and s _{2}(t)=ϵ(t)a(t), where ϵ(t) is an independent BPSK random variable and a(t) is an independent zeromean, unitvariance random variable.
In simulations, we choose a(t) as a mixture of Gaussians whose distribution is given in Equation (15). A typical realization of a distribution satisfying A3., A4.’ and A5.’ is illustrated by the simulated values shown in Figure 1a. Conditionally on r(t)=1, the vector (s _{1}(t),s _{2}(t))^{T} corresponds to the first two sources in Assumption A1, which means that for r(t)=1, (s _{1}(t),s _{2}(t))^{T} lie on one of the two bisectors of the horizontal and vertical axes. According to the discussion in Sections 2 and 2, for ${\kappa}_{4}^{\left(a\right)}\in [\frac{2}{7},2]$, algorithms based on the kurtosis contrast functions (such as CoM2) should not separate any linear mixture of (s _{1}(t),s _{2}(t))T based on the samples where r(t)=1.
On the contrary, considering the samples X _{0} only amounts to removing the set of dependent points that lie on the two bisectors of Figure 2a. In such a case, the remaining samples in X _{0} satisfy the usual requirement for ICA and any ICA algorithm should succeed in separating a linear mixture of the sources.
4.2.2 Example 2
The previous example is an extreme case where, conditionally on r(t)=1, either s _{1}(t)=s _{2}(t) or s _{1}(t)=s _{2}(t). Thus, the set of samples where s _{1}(t) and s _{2}(t) are dependent lies on the two bisectors of the horizontal and vertical axes. We considered in this example the case where, conditionally on r(t)=1, the components of (s _{1}(t),s _{2}(t))^{T} are dependent but have a continuous joint density. We particularize Assumption A5. as follows:
A5.” Let ${u}_{1}\sim \mathcal{N}(0,{\sigma}_{\lambda})$ and: ${u}_{2}\sim \mathcal{\mathcal{L}}\left(\lambda \right)$ be independent random variables where λ is a positive parameter and ${\sigma}_{\lambda}^{2}=2(1\frac{1}{{\lambda}^{2}})$. When r(t)=1, $\mathbb{P}\left(\mathbf{s}\right(t\left)\phantom{\rule{0.3em}{0ex}}\right\phantom{\rule{0.3em}{0ex}}r\left(t\right)=1)$ is such that:
To say it differently, the two components of
are independent, the first one being Gaussian and the second one being Laplace distributed. One can verify that the choice of ${\sigma}_{\lambda}^{2}$ ensures that, conditionally on r(t)=1, the sources are unitvariance. Such a distribution density is illustrated by simulated values in Figure 2a with λ=5 and in Figure 2b with $\lambda =\sqrt{2}$. Considering X _{0} only amounts to removing the cloud set of dependent points on the distributions. Visually, this seems much more difficult in Figure 2b than in Figure 2a, and much more difficult in this example than in Example 1 in Figure 1a. We will discuss further in Section 2 the influence of a good knowledge of r: surprisingly, it is not necessarily crucial in our method to have a good knowledge of r.
5 Separation method for the extended ICA model
5.1 Complete and incomplete data
Let us denote by θ=(B,η) the set of parameters to be estimated from the data: in this notation, we stress that θ consists of the separating matrix B and of the parameter vector η which characterize the distribution of r. Let us call (r,X) the set of complete data, whereas X alone is the set of incomplete data: since r is a hidden process, the model described in Section 2 corresponds to the situation where only incomplete data is available for estimation of the searched parameters θ. Note that the adjective blind is used to emphasize that S is unavailable, whereas incomplete emphasizes that r is unavailable.
5.2 Iterative conditional estimation
Iterative conditional estimation (ICE) is an iterative estimation method that applies in the context of incomplete data and that has been proposed in the problem of image segmentation [24, 29, 30].
Another wellknown iterative estimation technique is the expectationmaximization (EM) algorithm, which is based on the maximum likelihood estimator. Contrary to EM, the underlying estimator in ICE can be of any kind. This makes ICE more widely applicable in cases where the likelihood computation or maximization becomes intractable [29, 31], for example in the non gaussian case. In the case where the maximum likelihood is chosen as the underlying estimator, ICE show similarities with EM (see e.g., [32] for an experimental comparative study). It has been proven in [33] that in the case of distributions belonging to the exponential family, EM is one particular case of ICE. Finally, the interest of ICE and its convergence in the case of data that are probability mixtures has been showed in [34].
We now shortly describe ICE. The prerequisites in order to apply ICE are the following:

there exist an estimator from complete data $\widehat{\theta}(\mathbf{r},\mathbf{X})$,

one is able either to calculate $\mathbb{E}\left\{\widehat{\theta}\right(\mathbf{r},\mathbf{X}\left)\phantom{\rule{2.77695pt}{0ex}}\right\phantom{\rule{2.77695pt}{0ex}}\mathbf{X};\theta \}$ or to draw random variables according to $\mathbb{P}\left(\mathbf{r}\phantom{\rule{0.3em}{0ex}}\right\phantom{\rule{0.3em}{0ex}}\mathbf{X};\theta )$.
Starting from an initial guess of the parameters, the method consists in finding iteratively a sequence of estimates of θ, where each estimate is based on the previous one. More precisely, if ${\widehat{\theta}}^{\left[0\right]}$ is the first guess, the sequence of ICE estimates is defined by:
where $\mathbb{E}\left\{.\phantom{\rule{2.77695pt}{0ex}}\right\phantom{\rule{2.77695pt}{0ex}}\mathbf{X};{\widehat{\theta}}^{[q1]}\}$ denotes the expectation conditionally on X and with parameter values ${\widehat{\theta}}^{[q1]}$. If the above conditional expectation cannot be computed, it can be replaced by a sample mean, that is (20) can be replaced by:
where $K\in {\mathbb{N}}^{\ast}$ is fixed and each r ^{(k)} is drawn according to the a posteriori law $\mathbb{P}\left(\mathbf{r}\phantom{\rule{0.3em}{0ex}}\right\phantom{\rule{0.3em}{0ex}}\mathbf{X};{\widehat{\theta}}^{[q1]})$. Note that if θ is vectorvalued, (20) can be used for those components for which it can be computed, and (21) can be used otherwise.
Remark that the two conditions requested in order to apply ICE are very weak, which is the reason for our interest in ICE. In fact concerning the first one, there would be no hope to perform incomplete data estimation if no complete data estimator exists, whereas the second requirement consists only in being able to simulate random values according to the a posteriori law.
5.3 Applicability of ICE and assumed distributions
In this section, we give details about how the two conditions given in 2 for applicability of ICE are fulfilled.
First, as explained in Section 2, knowing r provides an easy way of estimating a separating matrix by considering as in (18) the subset X _{0} of the samples. A complete data estimator $\widehat{\theta}(\mathbf{r},\mathbf{X})$ hence exists.
To use ICE, one should additionally know the law $\mathbb{P}\left(\mathbf{r}\phantom{\rule{0.3em}{0ex}}\right\phantom{\rule{0.3em}{0ex}}\mathbf{X};\eta )$. Since X=A S, we have
and $\mathbb{P}\left(\mathbf{r}\phantom{\rule{0.3em}{0ex}}\right,\mathbf{X};\eta )$ is identical to the law $\mathbb{P}\left(\mathbf{r}\phantom{\rule{0.3em}{0ex}}\right\phantom{\rule{0.3em}{0ex}}\mathbf{S};\eta )$, where S=A ^{1} X=B X. An expression for the latter law is available if a model is assumed for $\mathbb{P}(\mathbf{S},\mathbf{r};\eta )$. Importantly, the chosen model used in the ICE estimation algorithm can be different from the model followed by the simulated data that are processed by the algorithm. This is crucial, because actual distributions are generally unknown in practical BSS problems. In particular, we here choose a model which follows Assumptions A3., A4., and A5., but which is different from the particular Assumptions A4.’, A5.’, and A5.” in Section 2. More precisely, translating Assumption A3. only, the joint distribution of (r,S) under the parameter value η reads:
The above equation holds both for the assumed distribution and for the distribution of the simulated data. On the contrary, the expressions of $\mathbb{P}(\mathbf{r};\eta )$ and $\mathbb{P}\left(\mathbf{s}\right(t\left)\phantom{\rule{0.3em}{0ex}}\right\phantom{\rule{0.3em}{0ex}}r\left(t\right);\eta )$ that are assumed in ICE differ from the ones of the simulated data and they are given in the next paragraphs.
5.3.1 Assumed
in ICE
First, note that in the following, similarly to the real data distribution, $\mathbb{P}\left(\mathbf{s}\right(t\left)\phantom{\rule{0.3em}{0ex}}\right\phantom{\rule{0.3em}{0ex}}r\left(t\right);\eta )=\mathbb{P}(\mathbf{s}\left(t\right)\phantom{\rule{0.3em}{0ex}}\left\phantom{\rule{0.3em}{0ex}}r\right(t\left)\right)$ does not depend on the parameters η to be estimated.
Experimentally, we observed that, for robustness reasons, the distribution $\mathbb{P}\left(\mathbf{s}\right(t\left)\phantom{\rule{0.3em}{0ex}}\right\phantom{\rule{0.3em}{0ex}}r\left(t\right))$ assumed in ICE should not have a bounded support or be too specific. For this reason, we assumed the following distributions:

$\mathbb{P}\left(\mathbf{s}\right(t\left)\phantom{\rule{0.3em}{0ex}}\right\phantom{\rule{0.3em}{0ex}}r\left(t\right)=0)\sim \mathcal{N}(0,1)$,

$\mathbb{P}\left(\mathbf{s}\right(t\left)\phantom{\rule{0.3em}{0ex}}\right\phantom{\rule{0.3em}{0ex}}r\left(t\right)=1)$ is such that both components of u defined in (19) have the distribution $\frac{1}{2}\mathcal{\mathcal{L}}\left(\lambda \right)+\frac{1}{2}\mathcal{N}\left(0,{\sigma}_{\lambda}^{2}\right)$.
A typical realization of the distribution of (s _{1}(t),s _{2}(t))^{T} assumed in ICE is given in Figure 3. This distribution is of course different from the ones of the simulated data in Figures 1 and 2. However, simulations will confirm that it is a reasonable choice.
Conditionally on r(t)=0, (s _{1}(t),s _{2}(t))^{T} is assumed an independent, zeromean, unitvariance Gaussian vector: this is an uninformative distribution having a density with non bounded support. Conditionally on r(t)=1, each component of u in the assumed distribution is a mixture of a Laplace and a Gaussian law. Experimentally, and from the observation of Figures 1 and 3, the latter choice seems a relevant approximation of the data model from Example 1 in Section 2. The parameter λ should make a compromise between a good fit to the data (λ large) and robustness of the method (λ small). Also, the assumed conditional distribution $\mathbb{P}\left(\mathbf{s}\right(t\left)\phantom{\rule{0.3em}{0ex}}\right\phantom{\rule{0.3em}{0ex}}r\left(t\right)=1)$ is invariant with respect to any permutation and sign change. Such a symmetry is necessary in our method because ICA algorithms leave permutation ambiguities. For this reason, the same distribution has been considered to model the data source signals generated in Example 2 (Section 2).
5.3.2 Assumed
in ICE
We propose two different models for the latent process r.
I.i.d. latent process
The simplest case is when r is an i.i.d. Bernoulli process, that is η is a scalar parameter in [0,1] and:
with for all t∈{1,…,T}:
Markov latent process
We can alternatively consider that r is a stationary Markov Chain, that is:
where $\mathbb{P}\left(r\right(t\left)\phantom{\rule{0.3em}{0ex}}\right\phantom{\rule{0.3em}{0ex}}r(t1))$ is given by a transition matrix independent of t. In this case, the parameters η consist of the different transition probabilities and the initial probabilities. The main advantage of considering a Markov model is that the posterior transitions $\mathbb{P}\left(r\right(t\left)\phantom{\rule{0.3em}{0ex}}\right\phantom{\rule{0.3em}{0ex}}r(t1),\mathbf{X};\eta )$ can be calculated by an efficient forwardbackward algorithm [35]. A sampling of the hidden process according to the posterior law $\mathbb{P}\left(\mathbf{r}\phantom{\rule{0.3em}{0ex}}\right\phantom{\rule{0.3em}{0ex}}\mathbf{X};\eta )$ is hence possible, making the ICE method applicable [29].
6 Combined ICA/ICE separation algorithm
In this section, we first detail our separation algorithm which combines ICA and ICE. We then interpret ICE in terms of random variable generation.
6.1 Algorithm
We denote by one among all possible ICA algorithms [1] such as JADE [26], CoM2 [25], FastICA [27], etc. Given a set of observation samples X, the separating matrix estimated by the ICA algorithm is denoted by B=(X). With these notations, a complete data estimator of the separating matrix is provided by B=(X _{0}), where X _{0} has been defined in Equation (18). Our algorithm consists in an ICE estimation of the parameters θ=(B,η). The parameters η which characterize r are estimated according to (20), whereas the separating matrix B is estimated according to (21) with K=1. Here is a summary of the algorithm:
Steps 1 and 3 of the above algorithm are further detailed in the following sections. Note that, if the parameters η are available as an additional information, they need not be estimated and step 3 of the algorithm is not necessary.
6.2 Details in the case of an i.i.d. latent process
We here detail the three steps of our algorithm in the case where r is i.i.d. As we will see, in this case, ICE is akin to generating a set of random samples that satisfy the usual assumptions of ICA with an acceptreject method.
6.2.1 Step 1
In the case where r is i.i.d., we have from (22), (23), and (24):
where $\mathbb{P}\left(r\right(t);\eta )$ has been given in Equation (25) and $\mathbb{P}\left(\mathbf{s}\right(t\left)\phantom{\rule{0.3em}{0ex}}\right\phantom{\rule{0.3em}{0ex}}r\left(t\right))$ has been described in Section 2. Writing ${\alpha}_{t}=\mathbb{P}\left(r\right(t)=0\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\mathbf{X};\eta )$ (and hence $1{\alpha}_{t}=\mathbb{P}\left(r\right(t)=1\phantom{\rule{0.3em}{0ex}}\mathbf{X};\eta )$, it follows from the above equation that in step 1, all samples of ${\widehat{\mathbf{r}}}^{\left[q\right]}$ are independent and such that
6.2.2 Step 2: acceptreject random variable generation
One can see that, when selecting ${\widehat{\mathbf{X}}}_{0}^{\left[q\right]}$ in the second step of our algorithm, some samples are kept, others are thrown away. A close parallel can be drawn with random variable generation by the acceptancerejection method.
For clarity, let us define ${\mathbb{P}}_{0}\left(\mathbf{s}\right(t\left)\right)=\mathbb{P}\left(\mathbf{s}\right(t\left)\phantom{\rule{0.3em}{0ex}}\right\phantom{\rule{0.3em}{0ex}}r\left(t\right)=0)$ and ${\mathbb{P}}_{1}\left(\mathbf{s}\right(t\left)\right)=\mathbb{P}\left(\mathbf{s}\right(t\left)\phantom{\rule{0.3em}{0ex}}\right\phantom{\rule{0.3em}{0ex}}r\left(t\right)=1)$. At time t, according also to Equations (25) and (23), the distribution of s(t) is given by:
We have the following lemma:
Lemma 4
Let s be a random variable (or random vector) with probability distribution given by
Let $\widehat{r}\in \{0,1\}$ be a binary random variable with probability distribution such that:
Then, the conditional distribution $\mathbb{P}\left(\mathbf{s}\phantom{\rule{0.3em}{0ex}}\right\phantom{\rule{0.3em}{0ex}}\widehat{r}=0)$ is ${\mathbb{P}}_{0}\left(\mathbf{s}\right)$.
Proof
One can write the joint probability distribution $\mathbb{P}(\mathbf{s},\widehat{r})=\mathrm{\eta \delta}(\widehat{r}=0){\mathbb{P}}_{0}\left(\mathbf{s}\right)+(1\eta )\delta (\widehat{r}=1){\mathbb{P}}_{1}\left(\mathbf{s}\right)$ and the result follows by conditioning. □
Lemma 4 relates our algorithm to the acceptreject algorithm for random variable generation. Indeed, drawing $\widehat{r}$ and conditioning on $\widehat{r}=0$ to get $\mathbb{P}\left(\mathbf{s}\phantom{\rule{0.3em}{0ex}}\right\phantom{\rule{0.3em}{0ex}}\widehat{r}=0)$ is performed in our algorithm by drawing ${\widehat{r}}^{\left[q\right]}\left(t\right)$ and keeping only those samples s(t) of S, where ${\widehat{r}}^{\left[q\right]}\left(t\right)=0$, that is, it amounts to rejecting those samples where ${\widehat{r}}^{\left[q\right]}\left(t\right)=1$. In other terms, through the ICE algorithm, the target distribution ${\mathbb{P}}_{0}\left(\mathbf{s}\right)$ is generated using the instrumental distribution $g\left(\mathbf{s}\right)=\eta {\mathbb{P}}_{0}\left(\mathbf{s}\right)+(1\eta ){\mathbb{P}}_{1}\left(\mathbf{s}\right)$. Samples from this instrumental distribution g(s) are given by the data themselves. By doing so, we obtain a set of data following the distribution ${\mathbb{P}}_{0}\left(\mathbf{s}\right)$. Since ${\mathbb{P}}_{0}\left(\mathbf{s}\right(t\left)\right)=\mathbb{P}\left(\mathbf{s}\right(t\left)\phantom{\rule{0.3em}{0ex}}\right\phantom{\rule{0.3em}{0ex}}r\left(t\right)=0)$ and according to Assumption A4., this is precisely the distribution under which the ICA algorithm is applicable.
Remark 3
It is known that the probability distribution of s in Lemma 4 can be seen as the marginal of (r,s), where r is a Bernoulli process and the conditional laws of s knowing r are given by ${\mathbb{P}}_{0}$ and ${\mathbb{P}}_{1}$. However, $\widehat{r}$ is drawn independently of r and in particular, $\widehat{r}$ is different from r. It means that in our algorithm, the original latent process r and the ICE sampling ${\widehat{\mathbf{r}}}^{\left[q\right]}$ may be quite different, although the selected samples are distributed following ${\mathbb{P}}_{0}$. This will be illustrated in Section 2.
6.2.3 Step 3
A complete data estimator of the parameter η is given by the empirical frequency $\widehat{\eta}=\frac{1}{T}\sum _{t=1}^{T}\delta \left(r\right(t)=0)$. Equation (20) then yields the following update rule for the parameter η:
6.3 Markov latent process
We here detail the three steps of our algorithm in the case where r is a Markov process.
6.3.1 Step 1
Here, we assume in the ICE part of the procedure that r is a Markov process. Then, the posterior probability $\mathbb{P}\left(\mathbf{r}\phantom{\rule{0.3em}{0ex}}\right\phantom{\rule{0.3em}{0ex}}\mathbf{X};{\widehat{\theta}}^{\left[q\right]})$ is Markov, and its transitions can be calculated according to the forwardbackward or BaumWelch algorithm. At each step q of the algorithm, ${\widehat{\mathbf{r}}}^{\left[q\right]}$ is then generated according to $\mathbb{P}\left(\mathbf{r}\phantom{\rule{0.3em}{0ex}}\right\phantom{\rule{0.3em}{0ex}}\mathbf{X};{\widehat{\theta}}^{\left[q\right]})$ as a nonstationary Markov chain. It is out of the scope to further detail these wellknown procedures and the reader is referred to [29, 30, 35] and references therein for more explanations.
6.3.2 Step 2
It is unchanged, although the interpretation as an acceptreject random variable generation does not hold any longer.
6.3.3 Step 3
From the values of $\mathbb{P}\left(r\right(t\left)\phantom{\rule{0.3em}{0ex}}\right\phantom{\rule{0.3em}{0ex}}\mathbf{X},{\widehat{\theta}}^{\left[q\right]})$ and $\mathbb{P}\left(r\right(t),r(t+1\left)\right)\phantom{\rule{0.3em}{0ex}}\mathbf{X},{\widehat{\theta}}^{\left[q\right]})$, formulas similar to Equation (28) yield the estimated probabilities for r(t) and the pairs (r(t),r(t+1)). The transition probabilities are then obtained by the relation $\mathbb{P}\left(r\right(t+1\left)\right)\phantom{\rule{0.3em}{0ex}}\left\phantom{\rule{0.3em}{0ex}}r\right(t),\mathbf{X},{\widehat{\theta}}^{\left[q\right]})=\frac{\mathbb{P}\left(r\right(t),r(t+1\left)\right)\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\mathbf{X},{\widehat{\theta}}^{\left[q\right]})}{\mathbb{P}\left(r\right(t\left)\phantom{\rule{0.3em}{0ex}}\right\phantom{\rule{0.3em}{0ex}}\mathbf{X},{\widehat{\theta}}^{\left[q\right]})}$. The reader is again referred to [29, 30, 35] for more details.
7 Simulations
We performed several simulations for different numbers of samples, respectively, T=1000, T=5000, T=10000. We have set q _{max}=30 in our algorithm: this value has been determined empirically by testing values of q _{max} up to 150. For values greater than q _{max}=30, no significant quality improvement has been observed and this choice seems hence satisfactory as far as ICE convergence is concerned. Although highly interesting, a more detailed analysis of the convergence speed/rate of our algorithm is out of the scope of the article. The separation quality criterion considered is the mean square error (MSE): the provided value is a mean of the MSE over all sources. Additionally, we considered the empirical segmentation rate provided by the last sampling ${\widehat{\mathbf{r}}}^{\left[{q}_{max}\right]}$ in Step 1 of the ICE procedure. We define this segmentation rate as follows:
This corresponds to the proportion of time samples which are correctly classified as corresponding either to independent (r(t)=0) sources or to dependent (r(t)=1) sources. In all cases, the source signals and the mixing matrix have been randomly generated.
The classical algorithm used in our simulations, which has been denoted ICA in Section 2 was CoM2 [25]. For comparison, we considered as a worst case reference the result when CoM2 is applied to the data, simply ignoring that the ICA assumptions are violated at some sample times. On the other hand, as an optimal best case reference, we considered the complete data situation, that is when r is available and the separation is performed based on X _{0}.
In our simulations, we generated data according to the following models:

sources satisfying Example 1 in Section 2. We chose the process a(t) as a mixture of Gaussians as defined in Equation (15), with μ=0.5. According to Section 2, the CoM2 algorithm, if applied on the dependent data only, is not successful in separating the sources.

sources satisfying Example 2 in Section 2. In this case, we choose $\lambda =\sqrt{2}$.
7.1 I.i.d. latent process with known η
We first provide some simulations in the case where r is an i.i.d. Bernoulli process with known η parameter (see Equation (24)).
7.1.1 Example 1
Separation results for sources generated according to Example 1 with r i.i.d. are gathered in Figures 4 and 5. Values in Figure 4 are average values over 1000 MonteCarlo realizations whereas values in Figure 5 are the individual results for 100 MonteCarlo realizations.
In Figure 4, the MSE values have been plotted depending on η∈[0,1] for T=1000 (a), T=5000 (b), T=10000 (c) samples. Note that for η=0, all samples of the data sources are dependent and the complete data separation is therefore not applicable. Similarly, when setting η to zero in the ICE algorithm, all samples are necessarily classified as dependent and our method is not applicable. For this reason, when η=0 has been used in the simulated sources we have set η to 0.1 in the algorithm.
Naturally, the complete data case provides the best results: when η tends to zero, the number of samples in X _{0} decreases, which explains the increasing MSE for smaller η values. On the other hand, ignoring the dependence provides results that are unacceptable for η smaller than 0.7, approximately. It seems however that CoM2 is able to separate the particular sources considered for η greater then 0.7, approximately: from the results in Section 2, the existence of such a limit value above which CoM2 is successful seems quite natural. For η smaller than 0.7 the proposed method provides a very significant improvement in terms of separation quality. The classification rate Υ _{ICE} has also been plotted in Figure 4 and one can observe that samples of dependent/independent sources are quite well classified, which corroborates the good MSE values obtained with our method.
Depending on the parameter λ, the source distribution assumed in the ICE estimation and described in Section 2 is closer to or farther from the real source distribution of Example 1, which has part of its samples lying on the two bisectors (see the comparison between Figure 1a and Figure 3a,b). In Figure 5, we tested the influence of the parameter λ for a fixed value of η=0.5. The MSE of 100 MonteCarlo realizations have been plotted for T=1000 and T=10000 samples. The corresponding segmentation rate Υ _{ICE} has been plotted in the lower part of the figures. First, it can be seen in Figure 5 that the separation results of the MonteCarlo realizations can be clearly separated in two groups:

for a minority of cases, the segmentation rate Υ _{ICE} is close to 0.5, meaning that the procedure completely fails to classify the latent process r. In such a case, the corresponding MSE is important (around 0.5) and separation is unsuccessful. This situation occurs more often for large values of λ: a greater value of λ consequently implies less robustness of our method.

for most of the realizations, Υ _{ICE} is significantly greater than 0.5, indicating that the procedure succeeds in classifying approximately the latent process r. In such a case, the corresponding MSE is low and the separation is successful. One can see that, in such a case where the separation is successful, a greater value of λ comes with a greater Υ _{ICE}, hence a better segmentation of r and a lower MSE, hence a better separation quality.
In conclusion, the choice of λ should be a compromise between good separation quality (in case of success) and robustness in order to limit the rate of separation failure.
7.1.2 Example 2
The results for source signals following Example 2 (see Section 2 with $\lambda =\sqrt{2}$) are given in Figure 6. Similarly to the previous section, for T=1000 (a), T=5000 (b), T=10000 (c) samples and depending on η∈[0,1] the MSE values have been plotted in the top graph and the segmentation rate Υ _{ICE} in the bottom graph.
As expected, the complete data estimation based on X _{0} of the separating matrix provides again the best results. In comparison with the case where the dependence of the sources is ignored, our method provides a very significant improvement, in particular for η lying between 0.4 and 0.8, approximately. Very interestingly, as witnessed in Figure 6, the good performance in terms of separation and MSE is obtained with very poor performance in terms of classification of the latent process r: note in particular that for η=0.5, we have a low MSE whereas Υ _{ICE} is close to 0.5. The poor classification can be easily understood when comparing Figure 2b with Figure 1a. The important point in this observation is that a good classification of the latent process r is a sufficient but not a necessary condition for good ICA estimation. This is in contrast with the former example but is fully justified by the interpretation of Step 2 of the algorithm in Section 2. Indeed, according to Lemma 4, the ICE part of our algorithm selects a set of samples ${\widehat{\mathbf{X}}}_{0}$ which is such that its distribution satisfies the usual ICA assumptions, although the classification may well be very poor.
7.2 I.i.d. latent process with unknown η
We now consider the case where r is i.i.d. but the parameter η in Equation (25) is unknown. In this case, η is estimated by the ICE algorithm. The results have been averaged over 1000 MonteCarlo realizations and are gathered in Table 5 for both sources from Example 1 and Example 2. In the case of Example 1, we have set the parameter λ=5 in the assumed distribution of Section 2, whereas we have set $\lambda =\sqrt{2}$ in the case of Example 2. For comparison, we provide the MSE results when η is known and when the algorithm CoM2 is applied, just ignoring that the generated source signals are dependent.
One can see from the values in Table 5 that our separation method remains valid even in the case where η is estimated.
7.3 Markov latent process
As previously indicated, the ICE estimation algorithm is able to take into account the Markov dependence of r. We have considered the importance of modeling r as a Markov or i.i.d. process.
The process r has been generated as a stationary Markov chain with transition matrix $\left(\genfrac{}{}{0ex}{}{0.9\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}0.1}{0.1\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}0.9}\right)$, in which case $\mathbb{P}\left(r\right(t)=0)=\mathbb{P}\left(r\right(t)=1)=\frac{1}{2}$. On the corresponding data set with r Markov, our separation method has been performed both in the case where r is modeled as i.i.d. and in the case where r is modeled as a Markov chain. In both cases, the parameters in η have been estimated from the incomplete data. Figure 7 provides the results for sources from Example 1 (with λ=15 in the algorithm) and Figure 8 provides the results for sources from Example 2 (with $\lambda =\sqrt{2}$ in the algorithm). The separation results with complete data and with CoM2 ignoring the dependence of the sources have also been plotted and, as expected, they correspond respectively to the best/worse MSE value. On the top plots of Figure 7, one can see that, in sources from Example 1, the performance is improved when taking into account the Markovianity of r. More precisely, similarly to the simulations in Section 2 with λ=15, the MSE values show a successful separation for a majority of realizations: in such a case, the Markovianity assumption clearly improves the MSE. However, the Markovianity property does not significantly improve the robustness of the method, since the number of realizations where separation has failed is approximately identical with a Markov and an i.i.d. model.
Interestingly, a different behavior can be seen in Figure 8 with data generated according to Example 2. Indeed, the performance has not improved when r has been modeled as a Markov chain. Intuitively, this may come from the fact that the two components of the probability mixture in the source distribution are harder to distinguish, as it has been illustrated in Figure 2.
8 Conclusion
In this article, we have first studied the behavior of kurtosis based contrast functions in specific cases of dependent sources. Observing that the criteria are polynomial functions of the parameters, we have been able to explicitly give the theoretical maxima as a function of a kurtosis value parameter ${\kappa}_{4}^{\left(a\right)}$. The behavior of the classical kurtosis contrast function can thus be understood depending on ${\kappa}_{4}^{\left(a\right)}$ and its restricted validity has been proved.
We have then introduced a model of dependent sources which consists in a probability mixture of two distributions. One component of the mixture satisfies the ICA assumption and, using an ICE estimation method, we have been able to exploit this information in order to perform separation. This example suggests that many more dependent sources might be separated if an adequate model of their distribution is provided. More generally, since the distribution model as a probability mixture may be an approximate one, an interesting problem would be to know to what extent a given distribution may be approximated by our proposed model in order to perform BSS. Finally, simulations have confirmed and validated our theoretical results.
Endnotes
^{a} It rectifies an error in [22].^{b} One can show that the number of solutions of (10) is indeed finite. Note that one can also solve the polynomial equations, as has been done in the proof of Proposition 1.
References
 1.
Comon P, Jutten (eds) C: Handbook of Blind Source Separation, Independent Component Analysis and Applications. Oxford: Academic Press; 2010.
 2.
Li TH: Finitealphabet information and multivariate blind deconvolution and identification of linear systems. IEEE Trans. Inf. Theory 2003, 49: 330337. 10.1109/TIT.2002.806138
 3.
Comon P: Contrasts, independent component analysis, and blind deconvolution. Int. J. Adapt. Control Sig. Proc 2004, 18(3):225243. [hal00542916] 10.1002/acs.791
 4.
Hyvärinen A, Shimizu S: A quasistochastic gradient algorithm for variancedependent component analysis. In Proc. International Conference on Artificial Neural Networks (ICANN). Greece: Athens; 2006:211220.
 5.
Cardoso JF: Multidimensional independent component analysis. In Proc. ICASSP. Seattle; 1998:19411944.
 6.
Quirós A, Wilson SP: Dependent Gaussian mixture models for source separation. In European Signal Processing Conference (EUSIPCO). Spain: Barcelona; 2011:17231727.
 7.
Almeida M, Vigário R, Dias JB: Phase locked matrix factorization. In European Signal Processing Conference (EUSIPCO). Spain: Barcelona; 2011:17281732.
 8.
Gutch HW, Krumsiek J, Theis FJ: An ISA algorithm with unkonwn group sizes identifies meaningful clusters in metabolomics data. In European Signal Processing Conference (EUSIPCO). Spain: Barcelona; 2011:17331737.
 9.
Nordhausen K, Oja H: Scatter matrices with independent block property and ISA. In European Signal Processing Conference (EUSIPCO). Spain: Barcelona; 2011:17381742.
 10.
Rafi S, Castella M, Pieczynski W: An extension of the ICA model using latent variable. In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). Czech Republic: Prague; 2011:37123715.
 11.
Bach FR, Jordan MI: Treedependent component analysis. In Uncertainty in Artificial Intelligence (UAI): Proceedings of the Eighteenth Conference. San Francisco; 2002:3644.
 12.
Hyvärinen A, Hoyer PO, Inki M: Topographic independent component analysis. Neural Comput 2001, 13(7):15271558. 10.1162/089976601750264992
 13.
Caiafa CF, Kuruoglu EE, Proto AN: A minimax entropy method for blind separation of dependent components in astrophysical images. International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering (Maxent) 2006.
 14.
Kuruoglu EE: Dependent component analysis for cosmology: a case study. In Proc. of LVA/ICA, Volume 6365 of LNCS. France: StMalo; 2010:538545.
 15.
Kim T, Lee I, Lee TW: Independent vector analysis: definition and algorithms. In Fortieth Asilomar Conference on Signal, Systems and Computers (ACSSC). CA: Pacific Grove; 2006:13931396.
 16.
Kohl F, Wübbeler G, Kolossa D, Elster C, Bär M, Orglmeister R: Nonindependent BSS: a model for evoked MEG signals with controllable dependencies. In Proc. of ICA’09, Volume 5441 of LNCS. Brazil: ParatyRJ; 2009:443450.
 17.
Akaho S: Conditionally independent component analysis for supervised feature extraction. Neurocomputing 2002, 49: 139150. 10.1016/S09252312(02)005180
 18.
Loubaton P, Regalia P: Blind deconvolution of multivariate signals: a deflation approach. In Proceedings of ICC. Switzerland: Geneva; 1993:11601164.
 19.
Delfosse N, Loubaton P: Adaptive blind separation of independent sources: a deflation approach. Signal Process 1995, 45: 5983. 10.1016/01651684(95)00042C
 20.
Comon P: Blind identification and source separation in 2×3 underdetermined mixtures. IEEE Trans. Signal Process 2004, 52: 1122. 10.1109/TSP.2003.820073
 21.
Comon P, Grellier O: Nonlinear inversion of underdetermined mixtures. In Proc. of ICA’99. France: Aussois; 1999:461465.
 22.
Castella M, Comon P: Blind separation of instantaneous mixtures of dependent sources. In Proc. of ICA’07, Volume 4666 of LNCS. UK: London; 2007:916.
 23.
Castella M: Inversion of polynomial systems and separation of nonlinear mixtures of finitealphabet sources. IEEE Trans. Signal Process 2008, 56(8, Part 2):39053917.
 24.
Pieczynski W: Statistical image segmentation. Mach. Graph. Vis 1992, 1(1/2):262268.
 25.
Comon P: Independent component analysis, a new concept. Signal Process 1994, 36(3):287314. 10.1016/01651684(94)900299
 26.
Cardoso JF, Souloumiac A: Blind beamforming for non gaussian signals. IEE ProceedingsF 1993, 140(6):362370.
 27.
Hyvärinen A, Oja E: Independent component analysis: algorithms and applications. Neural Netw 2000, 13(4–5):411430.
 28.
Lossen C: Singular: a computer algebra system. Comput. Sci. Eng 2003, 5(4):4555. 10.1109/MCISE.2003.1208641
 29.
Lanchantin P, LapuyadeLahorgue J, Pieczynski W: Unsupervised segmentation of randomly switching data hidden with nonGaussian correlated noise. Signal Process 2011, 91: 163175. 10.1016/j.sigpro.2010.05.033
 30.
Derrode S, Pieczynski W: Signal and image segmentations using pairwise Markov chains. IEEE Trans. Signal Process 2004, 52(9):24772489. 10.1109/TSP.2004.832015
 31.
Pieczynski W: EM and ICE in hidden and triplet Markov models. In Stochastic Modeling Techniques and Data Analysis international conference (SMTDA). Greece: Chania; 20102010.
 32.
Peng A, Pieczynski W: Adaptive mixture estimation and unsupervised local Bayesian image segmentation. Graphic. Models Image Process 1995, 57(5):389399. 10.1006/gmip.1995.1033
 33.
Delmas J: An equivalence of the EM and ICE algorithm for exponential family. IEEE Trans. Signal Process 1997, 45(10):26132615. 10.1109/78.640732
 34.
Pieczynski W: Sur la convergence de l’estimation conditionnelle itérative. Compte rendus de l’Académie des Sciences Mathématiques 2008, 246(7–8):457460.
 35.
Devijver PA: Baum’s forwardbackward algorithm revisited. Pattern Recogn. Lett 1985, 3: 369373. 10.1016/01678655(85)900236
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Castella, M., Rafi, S., Comon, P. et al. Separation of instantaneous mixtures of a particular set of dependent sources using classical ICA methods. EURASIP J. Adv. Signal Process. 2013, 62 (2013). https://doi.org/10.1186/16876180201362
Received:
Accepted:
Published:
Keywords
 Blind source separation
 Dependent sources
 Independent Component Analysis (ICA)
 Higher order statistics
 Iterative Conditional Estimation (ICE)