Here, a method is proposed for separation of the source of interest *s* from *M* mixed signals **x**. The goal is to estimate *B* such that *y*=*B* **x** be similar as possible to the original source *s*. *s* is unknown but we have the visual stream corresponding to it, we can estimate *B* such that \widehat{\mathcal{V}} (the estimated visual stream corresponding to *y*), be as close as possible to .

A problem with objective function (6) of [12] and [14] is that it does not efficiently model the non-linear AV relation (as is discussed later in this section). Also it considers independence (i.i.d) assumption in modeling relation of consequent AV frames. We suggest to improve the AV criterion via more realistic assumptions.

Consider the batch-wise separation problem of equation (2) where every batch *τ* consists of *T* frames. It is ideal to model and measure the degree of AV coherency on the joint whole sequences of audio {\mathcal{S}}_{\tau}(1\phantom{\rule{0.3em}{0ex}}:\phantom{\rule{0.3em}{0ex}}T) and visual {\mathcal{V}}_{\tau}(1\phantom{\rule{0.3em}{0ex}}:\phantom{\rule{0.3em}{0ex}}T) frames considering the true dependency among the variables. Let {\mathcal{\mathcal{M}}}_{IDL}({\mathcal{S}}_{\tau},{\mathcal{V}}_{\tau}) be such an ideal model which measures the degree of incoherency between AV streams. Then, the de-mixing vector *B*_{
τ
} may be estimated by minimizing the ideal AV criterion {J}_{avIDL}(B;{\mathbf{x}}_{\tau},{\mathcal{V}}_{\tau})={\mathcal{\mathcal{M}}}_{IDL}({\mathcal{Y}}_{\tau},{\mathcal{V}}_{\tau}).

However, training such an ideal model is not practical due to the need for large amount of AV training data and also due to its train and optimization complexity. Hence, considering some relaxation assumptions which factorizes the model to a combination of some reusable factor(s) is inevitable. The independent and identically distributed (i.i.d) assumption considered in GMM model of (6) is not a fit assumption for modeling the speech AV streams. Thus we propose an enhanced model with a weaker independence assumption. Instead of considering absolute independence between AV frames, we consider a conditional independence assumption that is the coherency of an AV frame can be estimated independent of other frames given a context of a few (*K*) neighbor frames.

An extension of {p}_{av}(\mathcal{S},\mathcal{V}) to model joint probability density function (PDF) of *K* consecutive AV frames is not efficient. GMM and Gaussian distributions with full covariance matrices are not suitable for modeling large dimensional random vectors since the number of free parameters of these models is of order *O*(*d*^{2}) relative to the dimension *d* of input random vectors. Increasing the input dimension by concatenation of *K* AV frames will result in a very complex model with huge number of free parameters that are not used effectively.

We propose to use a MLP instead of GMM and mean square error (MSE) criterion instead of negative log probability (as incoherency measure) to provide an enhanced AV criterion. The number of free parameters of an MLP with narrow hidden layer(s) is of order *O*(*d*_{
i
}+*d*_{
o
}) relative to dimensions *d*_{
i
} and *d*_{
o
} of its input and output. Moreover, MLP makes efficient use of its free parameters in learning non-linear AV relation, according to its hierarchical structure compared to shallow and wide structure of GMM. MLP, like GMM, is differentiable relative to its input. Hence, an objective function defined based on MLP can be optimized with fast convergence using derivative based algorithms.

### 3.1 MLP audio visual model

Having acoustic and visual streams of feature frames and extracted from pairs of corresponding AV signals *s* and **V** (see Section 4.1), a context-dependent AV associator can be trained using a suitable non-linear function approximator: \widehat{\mathcal{V}}\left(k\right)=h\left({\mathcal{S}}_{e}\right(k\left)\right), where \mathcal{Se}\left(k\right)=E\left(\mathcal{S}\right(k\phantom{\rule{0.3em}{0ex}}-\phantom{\rule{0.3em}{0ex}}K/2\phantom{\rule{0.3em}{0ex}}-\phantom{\rule{0.3em}{0ex}}1:k\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}K/2\left)\right) is an embedded vector obtained from a context of *K* audio frames around frame *k*. An option for embedding *E*, is to stacks the center frame of the context and the first-order temporal difference of other frames. In this paper, we adopt an MLP with *K* input audio frames, one hidden layer of *N*_{
H
} neurons and a single visual frame as output, to approximate the AV mapping *h*(.). The MLP-based AV incoherency model {\mathcal{\mathcal{M}}}_{MLP} is then defined as

{\mathcal{\mathcal{M}}}_{MLP}\left(\mathcal{Ye}\right(k),\mathcal{V}(k\left)\right)={\parallel \mathcal{V}\left(k\right)-h\left(\mathcal{Ye}\right(k\left)\right)\parallel}^{2}

(7)

### 3.2 Audio visual source separation algorithm

To compensate for phoneme-viseme ambiguity, the MLP model must be used in a batch-wise manner. Hence, as in (6), AV criterion is boosted by integrating incoherency scores of *T* frames in each batch *τ*:

{J}_{avMLP}(B;{\mathbf{x}}_{\tau},{\mathcal{V}}_{\tau})=\sum _{k=1}^{T}{\mathcal{\mathcal{M}}}_{MLP}\left(\mathcal{Y}{e}_{\tau}\right(k),{\mathcal{V}}_{\tau}(k\left)\right)

(8)

Beside the difference in negative log probability and mean square error, another difference between AV objective functions (6) and (8) is the form of independence assumption in measuring the incoherency. The former considers absolute independence (i.e., i.i.d.) between the frames while the later assumes conditional independence.

For each batch *τ* of mixed signals and having a visual stream {\mathcal{V}}_{\tau} corresponding to one of the speech sources, the goal of separation is to find the de-mixing vector *B*_{
τ
}. As in (3), this can be achieved by minimizing AV contrast function:

{B}_{\tau}=\underset{B}{\text{argmin}}\left\{{J}_{avMLP}\right(B;{\mathbf{x}}_{\tau},{\mathcal{V}}_{\tau}\left)\right\}

(9)

This can be done via first- or second-order derivative-based optimization methods. For example, using the delta rule of gradient decent, we have

{B}_{\tau}\left(i\right)={B}_{\tau}(i-1)-\eta \frac{\partial {J}_{avMLP}(B;{\mathbf{x}}_{\tau},{\mathcal{V}}_{\tau})}{\mathrm{\partial B}}

(10)

where *η* is the learning rate which either is set to a fixed small number or is adjusted using line search. The gradient of *J*_{av MLP} with respect to *B* (omitting constant parameters for brevity) is calculated as:

\begin{array}{ll}\frac{\partial {J}_{avMLP}\left(B\right)}{\mathrm{\partial B}}& =\sum _{k=1}^{T}\frac{\partial {\mathcal{\mathcal{M}}}_{MLP}\left({\mathcal{Y}}_{{e}_{\tau}}\right(k\left)\right)}{\mathrm{\partial B}}\\ =\sum _{k=1}^{T}\frac{\partial {\mathcal{\mathcal{M}}}_{MLP}\left({\mathcal{Y}}_{{e}_{\tau}}\right(k\left)\right)}{\partial {\mathcal{Y}}_{{e}_{\tau}}\left(k\right)}\frac{\partial {\mathcal{Y}}_{{e}_{\tau}}\left(k\right)}{\mathrm{\partial B}}\end{array}

(11)

In the last summation, the first term is gradient of MLP AV model with respect to its input acoustic context {\mathcal{Y}}_{e}\left(k\right) and the second term is gradient of acoustic features with respect to the de-mixing model *B*. Gradient-based algorithm iteratively minimizes the problem (9). Starting from an initial point *B*_{
τ
}(0), at each iteration *i*, the gradient (11) is calculated, and using (10) or a quasi-Newton method, the improved de-mixing vector *B*_{
τ
}(*i*+1) is estimated. This continues until the change in the norm of *B*_{
τ
} or *J*_{av MLP}(*B*_{
τ
}) becomes smaller than a pre-defined threshold.

Since the AV contrast function is not convex, the optimization algorithm is prune to local minima. Thus, selection of a good initialization point *B*_{
τ
}(0) is important. A simple option may be to start from random initial points multiple times. Most ICA algorithms (including FastICA [26] and JADE [27]) start from uncorrelated or white signals. Thus, another suggestion for initial point *B*_{
τ
}(0) is to apply PCA on mixed signals **x**_{
τ
} of the current batch *τ* and, among eigenvectors, select a vector *W* that produces a signal *y*=*W* **x** which is most coherent with the visual stream {\mathcal{V}}_{\tau} and use it as the initial point *B*_{
τ
}(0).

Both the proposed and existing AVSS algorithms do not suffer from the permutation ambiguity due to the informed nature of AV contrast functions. Nevertheless, the scale indeterminacy should be considered in design of AV contrast function and optimization method. AV model must be invariant regarding a constant gain to audio signal; that is, it must comply with the following constraint:

{J}_{avMLP}(\mathrm{\alpha B};\mathbf{x},\mathcal{V})={J}_{avMLP}(B;\mathbf{x},\mathcal{V})

(12)

### 3.3 AVSS using AV coherency and independence criterion

Although the existing and the proposed AV coherency-based methods provide improvements in speech source separation, but these methods totally neglect the useful constraint of independence of the sources. The statistical independence criteria used by ICA methods has been successful in many BSS methods. In this section, we consider the benefit of using AV coherency and statistical independence together to gain more enhancement in speech source separation.

#### 3.3.1 Video-selected independent component

Due to permutation indeterminacy (4), separated signals from ICA methods can not directly be used in real speech processing applications. Further, to calculate output signal to interference ratio (SIR) performance of ICA methods, it is required to know which of the de-mixed signals is related to the source of interest.

AV incoherency scores from AV models may be incorporated to introduce loosely coupled video-assisted ICA [14]. For that, in each batch of signals, sources are estimated by ICA method, and the source with minimum incoherency relative to the visual stream is selected as speech of interest. JADE [27] is one of the most successful ICA methods because of its accurate separation and its uniform performance (equivariance property). In this paper, we use JADE algorithm together with MLP audio-visual model (for relevant source selection) as the video assisted JADE (denoted by JADE-AV).

#### 3.3.2 Hybrid video coherent and independent component analysis

Contrary to the previous section where a sequential and loose combination of ICA and AV coherency model was considered, here we propose a parallel and tight combination using a hybrid criterion which benefits from normalized kurtosis as a statistical independence measure in conjunction with the AV coherency measure.

Kurtosis and neg-entropy are used in ICA methods such as FastICA [26] which work by maximizing the non-Gaussianity. The first kurtosis-based BSS method was presented in [30] to separate sources via deflation. It starts by pre-whitening the observed signals. Then the first source is estimated as *y*=*B* **x**^{′} from white observations **x**^{′} using a normalized de-mixing vector *B*. It is estimated by maximizing the kurtosis of *y*, defined as kurt(*y*)=*E*{*y*^{4}}−3(*E*{*y*^{2}})^{2} (for zero-mean *y*) that is done via a gradient-like method. The kurtosis value is zero for Gaussian signals while it is positive or negative for signals with super- or sub-Gaussian distributions. If both super- and sub-Gaussian sources are expected to be extracted, then absolute or squared value of kurtosis must be maximized.

In [26], Hyvarinen et al. proposed a fast fixed point algorithm for solving the constrained optimization of the kurtosis and a family of other neg-entropy-based criteria under the normalized constraint for *B* which resulted in the well-known FastICA algorithm.

The reason for pre-whitening and forcing normalized constraint on *B* is that the kurtosis is not scale invariant (i.e. kurt(*α* *y*)=*α*^{4}kurt(*y*)) and hence it depends both on energy and non-gaussianity of the signal. In [31] and [32] normalized kurtosis defined as

{\text{kurt}}_{n}\left(y\right)=\frac{\text{kurt}\left(y\right)}{{\left(E\right\{{y}^{2}\left\}\right)}^{2}}

(13)

is adopted on direct observations. The normalized kurtosis is scale invariant (i.e. kurt_{
n
}(*α* *y*)=kurt_{
n
}(*y*),∀*α*≠0). Hence, it eliminates the necessity for pre-whitening and normalization constraint on the de-mixing vector *B*. To gain further improvement, we propose a hybrid criterion based on combination of the AV criterion (8) and the normalized kurtosis:

{J}_{avICA}(B;{\mathbf{x}}_{\tau},{\mathcal{V}}_{\tau})={J}_{avMLP}(B;{\mathbf{x}}_{\tau},{\mathcal{V}}_{\tau})-\lambda {\text{kurt}}_{n}\left(B{\mathbf{x}}_{\tau}\right)

(14)

where *λ* is a positive regularization coefficient. Since speech signal is known to have super-Gaussian distribution [33, 34], the kurtosis term is added with negative sign such that it tends to be maximized during minimization of (14).

It must be noted that, in short time durations, the kurtosis score is not robust and does not provide significant improvement. Thus, (14) is developed to be used for convolutive case where quite large batches are considered. In fact, our tests revealed that for small batch sizes used in instantaneous mixtures, the performance of the AV method using kurtosis penalty does not improve compared to the pure AV method.

### 3.4 Toward a time domain AVSS for convolutive mixtures

Here, we consider convolutive mixtures defined by a MIMO system of *M* × *N* FIR filters **A**=[*A*_{
i
j
}]. The mixture system can be represented in the the Z-domain as

\mathbf{X}\left(z\right)=\mathbf{A}\left(z\right)\mathbf{\text{S}}\left(z\right)

(15)

We are interested in estimation of a 1 × *M* row vector *B*(*z*) of de-mixing FIR filters which separates the source *S*^{1}(*z*)=*B*(*z*)**X**(*z*) that is as coherent as possible with the video stream **V**^{1}. In [31], a time domain algorithm based on maximizing (normalized) kurtosis is presented which deflates sources one-by-one using non-causal two-sided FIR filters. We consider it as our baseline audio-only convolutive method in our experiments. Following [35], we define an embedded matrix notation which transforms the convoltive mixture (15) to an equivalent instantaneous mixture. Let **x**^{′}(*n*) be an embedded column vector defined in each time step *n* as:

\phantom{\rule{-5.0pt}{0ex}}\begin{array}{ll}{\mathbf{x}}^{\prime}\left(n\right)& =\left[{x}^{1}(n+L),\cdots \phantom{\rule{0.3em}{0ex}},{x}^{1}(n-L),\cdots \phantom{\rule{0.3em}{0ex}},{x}^{M}(n+L),\cdots \phantom{\rule{0.3em}{0ex}},\right.\\ \phantom{\rule{1em}{0ex}}{\left(\right)close="]">\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}{x}^{M}(n-L)}^{}T\end{array}\n

(16)

It contains *M*(2*L*+1) observation samples and using it the convolutive de-mixing process for separation of signal *s*^{1} can be expressed as *y*(*n*)=*B* **x**^{′}(*n*) where *B* is a row vector containing coefficients of *M* de-mixing FIR filters each one having 2*L*+1 taps. This is just an instantaneous mixture with *M*(2*L*+1) virtual (embedded) observations and can be solved using the kurtosis-based method of [31] or using our proposed criteria (14).

As a final note, it should be mentioned that the reference method of [31], can estimate de-mixing filters up to a scale and time delay. Thus, a cross-correlation step is necessary to fix the possible delay of filters. For further details please refer to [31]. When dealing with convolutive mixtures, it is necessary to calculate the objective scores on longer segments of signals since there are larger number of parameters to estimate.