Audio visual speech source separation via improved context dependent association model
 Alireza Kazemi^{1}Email author,
 Reza Boostani^{1} and
 Fariborz Sobhanmanesh^{1}
https://doi.org/10.1186/16876180201447
© Kazemi et al.; licensee Springer. 2014
Received: 7 July 2013
Accepted: 18 February 2014
Published: 5 April 2014
Abstract
In this paper, we exploit the nonlinear relation between a speech source and its associated lip video as a source of extra information to propose an improved audiovisual speech source separation (AVSS) algorithm. The audiovisual association is modeled using a neural associator which estimates the visual lip parameters from a temporal context of acoustic observation frames. We define an objective function based on mean square error (MSE) measure between estimated and target visual parameters.
This function is minimized for estimation of the demixing vector/filters to separate the relevant source from linear instantaneous or timedomain convolutive mixtures. We have also proposed a hybrid criterion which uses AV coherency together with kurtosis as a nonGaussianity measure. Experimental results are presented and compared in terms of visually relevant speech detection accuracy and output signaltointerference ratio (SIR) of source separation. The suggested audiovisual model significantly improves relevant speech classification accuracy compared to existing GMMbased model and the proposed AVSS algorithm improves the speech separation quality compared to reference ICA and AVSSbased methods.
Keywords
1 Introduction
Audiovisual speech source separation (AVSS) is a growing field of research that is developed in recent years. It is derived from mixing audiovisual speech processing (AVSP) and blind source separation (BSS) techniques.
Speech is originally a bimodal audiovisual process. Perceptual studies on human audition have revealed that visual modality has effective contributions in speech intelligibility [1], perception [2] and detection [3] especially in the noisy and multisource (cocktail party) situations. According to the McGurkMcDonald effect [4] (that is, sensing the auditory part of a phonetic sound with visual part of another one, results in illusion of perception of a third one), it is evident that there is an early stage interaction between audio and visual stimuli in the brain. This is confirmed in [5] that early integration of audio and visual modalities can help in the identification and hence enhancement of speech in noisy environment. The performance of automatic speech processing systems degrades drastically in the presence of noise or other acoustic sources. Thus, researchers have tried to incorporate visual modality to automatic speech processing systems upon the perceptual findings.
Both audio and visual modalities of speech originate from gestures and dynamics of articulators along the speaker’s vocal tract. Hence, there is an intrinsic relation between these two speech cues. Although among all articulators, just the lip and, partially, jaws are visually observable. This partial observation bears a stochastic but exploitable relation between audio and visual cues.
It is inspiring to consider AV relation as two coherent and complementary components. In the automatic speech processing community, there has been early notification and interest (since 1984 [6]) for exploiting the complementary (orthogonal) portion of AV information prior to its coherent (nonorthogonal) portion. The complementary information of AV data is truly adopted in audiovisual speech recognition (AVSR) in either of early (feature), middle (model), or late (decoding) stage fusion schemes to enhance robustness against acoustic distortions. In recent years (since 2001 [7]), researchers have proposed methods based on exploiting the coherent component of AV processes for applicable tasks like speech enhancement [7–9], acoustic feature enhancement [10], visual voice activity detection (VVAD) [11], and AV source separation (AVSS) [11–24].
In [12], a statistical AV model based on Gaussian mixture models (GMMs) is presented for measuring the coherency of audio and its corresponding video and is used for extracting speech of interest from instantaneous squared mixtures on a simple French logatoms AV corpus. They have extended their method in [14] and assessed it on a more general sentence corpus and also for degenerate mixtures. Wang et al. [15] have exploited a similar GMM model (but using different AV features) as a penalty term for solving convolutive mixtures. That method seems to be inefficient because it should convert the separating system from frequency to time domain repeatedly. Rajaram et al. [13] have incorporated visual information in a Bayesian AVSS for separation of twochannel noisy mixtures. Their method adopts a Kalman filter with additional independence constraint between the states (sources). Rivet et al. [16] have adopted the AV coherency of speech (measured by a trained logRayleigh distribution) for resolving the permutation indeterminacy in the frequency domain separation of convolutive mixtures. They have also proposed another method [11] for convolutive AVSS based on developing a VVAD and using it in a geometric separation algorithm using sparse source assumption.
Sigg et al. in a pioneering work [17] have proposed a single microphone AVSS method by developing a nonnegative sparse canonical correlation analysis (NSCCA) algorithm. Their method jointly separates audio signals and localizes their corresponding visual sources. Following them, Casanovas and Monaci et al. [18–21] have proposed single microphone AV separation and localization methods by sparse and redundant atomic representation of AV signals. They use crossmodal correlations between AV atoms as similarity measure to cluster visual atoms for localizing visual sources and then separating audio signals.
Liang et al. [22] have incorporated visual localization to improve the fast independent vector analysis (FastIVA) as a frequency domain convolutive method. They use location of sources for smart initialization of FastIVA to solve its block permutation. Liu et al. [23] have proposed an AV dictionary learning method (AVDL) and have used it for AVBSS via bimodal sparse coding to estimate timefrequency (TF) masks.
Khan et al. [24] have proposed a videoaided separation method for twochannel reverberant recordings which estimates direction of sources via visual localization to be used in probabilistic models which are refined using EM algorithm and evaluated at discrete TF points to generate separating masks.
In this paper, we develop a visually informed speech source separation algorithm called MLPAVSS which considers temporal dependency between consecutive AV frames. We have suggested to model AV coherency using a multilayer perceptron (MLP) for AV association. This model with lower number of parameters can capture AV coherency significantly better relative to the GMM AV model of [12, 14, 15]. We have also proposed a hybrid measure of kurtosis and visual coherency and based on that a time domain convolutive AVSS algorithm. We have assessed quality of suggested AV model and its induced AVSS methods on two discrete (alphadigits) and continuous (poetverses) audiovisual corpora. The former is a corpus of Persian and English alpha digits and the later is a corpus of poem verses from about 20 Persian poets.
The rest of this paper is organized as follows: In Section 2, we briefly review BSS and AVSP background and then focus on the relevant AVSS work. Section 3 illustrates the proposed MLPbased AV model and AVSS algorithm. Section 3.3 presents a hybrid AV coherent and independent criterion, and based on that, we move toward a timedomain convolutive extension. In Section 4, audiovisual materials including AV corpus, parametrization and modeling procedures is considered. In Section 5, experimental setup and the experimental results are illustrated and analyzed. Finally, the paper is concluded in Section 6.
2 Background review
AVSS has emerged from mixing BSS and audiovisual speech processing techniques [16]. In this section, after a brief review of BSS and AV speech processing background, we explain the speech separation in terms of standard source separation problem and then discuss the suggested AV separation approach as an improved solution for this problem.
2.1 Blind source separation problem
where $\mathbf{s}\left(t\right)=\phantom{\rule{0.3em}{0ex}}{\left[{s}_{1}\right(t),\dots ,{s}_{N}(t\left)\right]}^{T}\in {\mathbb{R}}^{N}$ is vector of source samples, $\mathbf{x}\left(t\right)={\left[{x}_{1}\right(t),\dots ,{x}_{M}(t\left)\right]}^{T}\in {\mathbb{R}}^{M}$ is vector of mixed signals and $\mathbf{A}\left(t\right)\in {\mathbb{R}}^{M\phantom{\rule{0.3em}{0ex}}\times \phantom{\rule{0.3em}{0ex}}N}$ is mixing matrix at the time instance t. It should be noticed that both s(t) and A are unknown. Hence, the problem is designated to BSS that is estimation of unmixed signals y(t)=B(t)x(t) from mixed signals x using an unknown demixing matrix B such that they are as similar as possible to unknown sources s.
where τ is the batch index iterating over all batches of the signals. In this case, the mixing and demixing models (A_{ τ }, B_{ τ }) are time invariant during each batch. Another solution is to consider adaptive source separation techniques which is beyond the scope of this paper.
2.2 Audiovisual speech processing
Before explaining AV source separation methods, it is necessary to review some issues in AV speech processing which also inherently arises in AV source separation:

The speech signal and lip video are nonstationary in time.

The rate of speech samples and video frames is significantly different. In this study, the speech signal is recorded by F s=16,000 Hz while the video frame rate is F r=30 fps.

Speech signal and video frames have large numbers of samples (pixels) containing sparse information. This prevents creating audiovisual models directly from these signals.
To cope with first two issues, in most speech processing problems, speech is processed framewise with frames of 20−30 ms length where speech signal can be considered stationary. In AV speech processing, it is convenient to choose speech frame length such that audio and video frame rates are equal.
For handling the third issue, the routine solution is to extract compact and informative acoustic and visual features from speech and video frames such that each frame is represented with a few number of parameters. Framewise processing of speech is practical in most speech processing tasks, but the amount of speech signal in a single frame may be insufficient for source separation algorithms to perform accurately. Hence, a couple of consecutive frames must be used in each batch τ.
2.3 Audiovisual source separation
that is signal’s amplitude gain and their order cannot be determined using these algorithms. The permutation of estimated sources may also change within consecutive frames, because sources are nonstationary in time and space. Having true or at least stable ordering of sources is crucial in most automatic speech processing applications. Furthermore, ICAbased methods do not consider or perform weakly in case of noisy and degenerate mixtures (i.e., mixtures with M<N).
Incorporation of visual modality of speech as a source of extra information, can help to solve these problems. The permutation problem can be simply resolved and enhancement in the separation performance is gained in regular and degenerate mixtures.
Most AVSS algorithms work based on maximization of AV coherency between unmixed signals y and their corresponding video streams. It is shown in [12] that given coarse spectral envelope of sources, one can solve a system of equations for calculation of demixing matrix in regular mixtures. Moreover there exists a stochastic coherent relation between the speech spectral envelope and the lip visual features [9, 12]. These two facts have guided researchers toward capturing AV relation using different models and adopt it for AVSS tasks.
In [12] and [14], authors have proposed a joint statistical distribution ${p}_{\text{av}}(\mathcal{S},\mathcal{V})$ as an AV model which measures the coherency between the acoustic spectral ( ) and lip visual ( ) features of the speech in each frame. The distribution ${p}_{\text{av}}(\mathcal{S},\mathcal{V})$ is modeled by the GMM and is trained using a corpus of corresponding AV streams via the Expectation Maximization (EM) algorithm.
The summation in (6) is based on the assumption that AV frames in consecutive frames are independent from each other.
In the rest of this text unless mentioned otherwise, we always consider a single row demixing vector denoted by B corresponding to a single visual stream. For the sake of brevity we omit the superscript (.)^{1} of variables. It is clear that in case of existence of multiple video streams corresponding to more than one speech sources, all the described methods can be repeated for each video stream.
3 Audiovisual speech source separation using MLP AV modeling
Here, a method is proposed for separation of the source of interest s from M mixed signals x. The goal is to estimate B such that y=B x be similar as possible to the original source s. s is unknown but we have the visual stream corresponding to it, we can estimate B such that $\widehat{\mathcal{V}}$ (the estimated visual stream corresponding to y), be as close as possible to .
A problem with objective function (6) of [12] and [14] is that it does not efficiently model the nonlinear AV relation (as is discussed later in this section). Also it considers independence (i.i.d) assumption in modeling relation of consequent AV frames. We suggest to improve the AV criterion via more realistic assumptions.
Consider the batchwise separation problem of equation (2) where every batch τ consists of T frames. It is ideal to model and measure the degree of AV coherency on the joint whole sequences of audio ${\mathcal{S}}_{\tau}(1\phantom{\rule{0.3em}{0ex}}:\phantom{\rule{0.3em}{0ex}}T)$ and visual ${\mathcal{V}}_{\tau}(1\phantom{\rule{0.3em}{0ex}}:\phantom{\rule{0.3em}{0ex}}T)$ frames considering the true dependency among the variables. Let ${\mathcal{\mathcal{M}}}_{IDL}({\mathcal{S}}_{\tau},{\mathcal{V}}_{\tau})$ be such an ideal model which measures the degree of incoherency between AV streams. Then, the demixing vector B_{ τ } may be estimated by minimizing the ideal AV criterion ${J}_{avIDL}(B;{\mathbf{x}}_{\tau},{\mathcal{V}}_{\tau})={\mathcal{\mathcal{M}}}_{IDL}({\mathcal{Y}}_{\tau},{\mathcal{V}}_{\tau})$.
However, training such an ideal model is not practical due to the need for large amount of AV training data and also due to its train and optimization complexity. Hence, considering some relaxation assumptions which factorizes the model to a combination of some reusable factor(s) is inevitable. The independent and identically distributed (i.i.d) assumption considered in GMM model of (6) is not a fit assumption for modeling the speech AV streams. Thus we propose an enhanced model with a weaker independence assumption. Instead of considering absolute independence between AV frames, we consider a conditional independence assumption that is the coherency of an AV frame can be estimated independent of other frames given a context of a few (K) neighbor frames.
An extension of ${p}_{av}(\mathcal{S},\mathcal{V})$ to model joint probability density function (PDF) of K consecutive AV frames is not efficient. GMM and Gaussian distributions with full covariance matrices are not suitable for modeling large dimensional random vectors since the number of free parameters of these models is of order O(d^{2}) relative to the dimension d of input random vectors. Increasing the input dimension by concatenation of K AV frames will result in a very complex model with huge number of free parameters that are not used effectively.
We propose to use a MLP instead of GMM and mean square error (MSE) criterion instead of negative log probability (as incoherency measure) to provide an enhanced AV criterion. The number of free parameters of an MLP with narrow hidden layer(s) is of order O(d_{ i }+d_{ o }) relative to dimensions d_{ i } and d_{ o } of its input and output. Moreover, MLP makes efficient use of its free parameters in learning nonlinear AV relation, according to its hierarchical structure compared to shallow and wide structure of GMM. MLP, like GMM, is differentiable relative to its input. Hence, an objective function defined based on MLP can be optimized with fast convergence using derivative based algorithms.
3.1 MLP audio visual model
3.2 Audio visual source separation algorithm
Beside the difference in negative log probability and mean square error, another difference between AV objective functions (6) and (8) is the form of independence assumption in measuring the incoherency. The former considers absolute independence (i.e., i.i.d.) between the frames while the later assumes conditional independence.
In the last summation, the first term is gradient of MLP AV model with respect to its input acoustic context ${\mathcal{Y}}_{e}\left(k\right)$ and the second term is gradient of acoustic features with respect to the demixing model B. Gradientbased algorithm iteratively minimizes the problem (9). Starting from an initial point B_{ τ }(0), at each iteration i, the gradient (11) is calculated, and using (10) or a quasiNewton method, the improved demixing vector B_{ τ }(i+1) is estimated. This continues until the change in the norm of B_{ τ } or J_{av MLP}(B_{ τ }) becomes smaller than a predefined threshold.
Since the AV contrast function is not convex, the optimization algorithm is prune to local minima. Thus, selection of a good initialization point B_{ τ }(0) is important. A simple option may be to start from random initial points multiple times. Most ICA algorithms (including FastICA [26] and JADE [27]) start from uncorrelated or white signals. Thus, another suggestion for initial point B_{ τ }(0) is to apply PCA on mixed signals x_{ τ } of the current batch τ and, among eigenvectors, select a vector W that produces a signal y=W x which is most coherent with the visual stream ${\mathcal{V}}_{\tau}$ and use it as the initial point B_{ τ }(0).
3.3 AVSS using AV coherency and independence criterion
Although the existing and the proposed AV coherencybased methods provide improvements in speech source separation, but these methods totally neglect the useful constraint of independence of the sources. The statistical independence criteria used by ICA methods has been successful in many BSS methods. In this section, we consider the benefit of using AV coherency and statistical independence together to gain more enhancement in speech source separation.
3.3.1 Videoselected independent component
Due to permutation indeterminacy (4), separated signals from ICA methods can not directly be used in real speech processing applications. Further, to calculate output signal to interference ratio (SIR) performance of ICA methods, it is required to know which of the demixed signals is related to the source of interest.
AV incoherency scores from AV models may be incorporated to introduce loosely coupled videoassisted ICA [14]. For that, in each batch of signals, sources are estimated by ICA method, and the source with minimum incoherency relative to the visual stream is selected as speech of interest. JADE [27] is one of the most successful ICA methods because of its accurate separation and its uniform performance (equivariance property). In this paper, we use JADE algorithm together with MLP audiovisual model (for relevant source selection) as the video assisted JADE (denoted by JADEAV).
3.3.2 Hybrid video coherent and independent component analysis
Contrary to the previous section where a sequential and loose combination of ICA and AV coherency model was considered, here we propose a parallel and tight combination using a hybrid criterion which benefits from normalized kurtosis as a statistical independence measure in conjunction with the AV coherency measure.
Kurtosis and negentropy are used in ICA methods such as FastICA [26] which work by maximizing the nonGaussianity. The first kurtosisbased BSS method was presented in [30] to separate sources via deflation. It starts by prewhitening the observed signals. Then the first source is estimated as y=B x^{′} from white observations x^{′} using a normalized demixing vector B. It is estimated by maximizing the kurtosis of y, defined as kurt(y)=E{y^{4}}−3(E{y^{2}})^{2} (for zeromean y) that is done via a gradientlike method. The kurtosis value is zero for Gaussian signals while it is positive or negative for signals with super or subGaussian distributions. If both super and subGaussian sources are expected to be extracted, then absolute or squared value of kurtosis must be maximized.
In [26], Hyvarinen et al. proposed a fast fixed point algorithm for solving the constrained optimization of the kurtosis and a family of other negentropybased criteria under the normalized constraint for B which resulted in the wellknown FastICA algorithm.
where λ is a positive regularization coefficient. Since speech signal is known to have superGaussian distribution [33, 34], the kurtosis term is added with negative sign such that it tends to be maximized during minimization of (14).
It must be noted that, in short time durations, the kurtosis score is not robust and does not provide significant improvement. Thus, (14) is developed to be used for convolutive case where quite large batches are considered. In fact, our tests revealed that for small batch sizes used in instantaneous mixtures, the performance of the AV method using kurtosis penalty does not improve compared to the pure AV method.
3.4 Toward a time domain AVSS for convolutive mixtures
It contains M(2L+1) observation samples and using it the convolutive demixing process for separation of signal s^{1} can be expressed as y(n)=B x^{′}(n) where B is a row vector containing coefficients of M demixing FIR filters each one having 2L+1 taps. This is just an instantaneous mixture with M(2L+1) virtual (embedded) observations and can be solved using the kurtosisbased method of [31] or using our proposed criteria (14).
As a final note, it should be mentioned that the reference method of [31], can estimate demixing filters up to a scale and time delay. Thus, a crosscorrelation step is necessary to fix the possible delay of filters. For further details please refer to [31]. When dealing with convolutive mixtures, it is necessary to calculate the objective scores on longer segments of signals since there are larger number of parameters to estimate.
4 Audiovisual data and models
Audiovisual corpus and model are building material toward realization and evaluation of the proposed AVSS algorithm which is a datadriven method. In the following, we look at AV corpus creation and models training.
4.1 Audiovisual data
To evaluate the proposed algorithm, we have recorded a proper AV corpora which is comparable in (size and complexity) to the corpora used in former research. Unlike [11, 12], we have not used lip blue makeups in data recordings since we do not need lip segmentation for extraction of geometric features such as width and height. Instead, the pixel gray values of speaker’s mouth region are used to extract the visual parameters. We have recorded two different types of corpora. The first corpus consists of discrete Persian and English alphabet and digits with a vocabulary size of 78 words (32 + 10 Persian and 26 + 10 English alphadigits). The second corpus is continuous and consists of 140 verses of Persian poets. Both corpora are uttered by a male speaker. Each corpus is recorded two times. The first recording is used for training AV models and the second recording is used in evaluation phase.
4.2 Audio and video parameter extraction
As discussed before, speech signal and lip image frames are highdimensional data with sparse information related to our task. Thus, parametrizing audio and visual frames to compact vectors is necessary. Here, we clarify the methods for audio and visual feature extraction.
4.2.1 Audio parametrization
The Jacobian of F with respect to B is derived in the Appendix in Equations 22, 23 and 24. The derived formulas are efficient and do not need FFT recalculation for different values of B. It is also worth to mention that the mapping F is invariant regarding a scalar multiplication (i.e. F(α B;X(k))=F(B;X(k)),∀α≠0). A property that entails gain invariance property (12) in AV contrast functions (6) and (8).
4.2.2 Video parametrization
In previous works, such as [11, 12, 14], authors have used geometric lip parameters that need lip contour detection to estimate the width and height of interior lip contour. We extract holistic visual features from all pixels of the mouth region. This requires less computation and does not require contour fitting. Let function g(.) be visual feature mapping function which extracts k_{ v } visual features from any video frame. We assume that mouth region can be extracted from video using detection and tracking algorithms. There exists efficient parametric head tracking algorithms such as [36] which can be adopted for this task. The corpus used in this paper simply provides lip region in each frame. To extract k_{ v } visual features, the mouth region of each frame is shrunk to 32 × 24 pixels and then reshaped to 768 × 1 image vectors. Finally, a PCA transform is applied to extract visual features. The PCA matrix ${\mathbf{W}}_{v}\in {\mathbb{R}}^{768\times {k}_{v}}$ is computed from the train data and is used to project train and test mouth region images to k_{ v }element visual parameter vectors. Figure 1b represents top major eigenvectors (eigenlips) in order.
The overall visual feature extraction function is defined as normalized projected gray values of mouth region: $\mathcal{V}\left(k\right)=g\left(\mathbf{V}\right(k\left)\right)={\mathbf{\text{Q}}}_{v}^{T}{\mathbf{W}}_{v}^{T}\mathbf{V}\left(k\right)$, where Q_{ v } is the diagonal scaling matrix calculated from square root of corresponding eigenvalues.
To assess and understand the virtue of visual features, a simple yet insightful simulation is illustrated in Figure 1c. In PCA, the eigenvector with largest eigenvalue captures most of the variance of dataset. Most variance of lip images during speaking is along opening and closing of lips. Thus, it is expected that the principal eigenvector ${\mathbf{W}}_{v}^{1}$ will model this direction of variation. To check this, we calculated mean vector µ_{ v } of all video frames in poetverses train corpus and illustrated its variations along the principal eigenvector ${\mathbf{W}}_{v}^{1}$ with negative and positive integer multiplies of square root of corresponding eigenvalue σ_{1}. Results presented in Figure 1c show that this has resulted in synthesized opening and closing of lip images.
4.3 Building audiovisual models
In addition to estimation of transforms W_{ a } and W_{ v }, that are part of the AV feature mapping functions f(.) and g(.), the train set of each corpus is used to learn the AV models. The training set consists of synchronous sequences of AV pairs $\left(\mathcal{S}\right(k),\mathcal{V}(k\left)\right)$ extracted from the raw AV data. For training models with K>1, first, the embedded acoustic stream ${\mathcal{S}}_{e}$ is formed by Kfold embedding of frames of acoustic stream . Instead of stacking the K frames of context, it is better to stack the center frame together with temporal difference of other frames. This reduces the redundancy in the embedded vector. Then, embedded pairs $\left({\mathcal{S}}_{e}\right(k),\mathcal{V}(k\left)\right)$ are used to train models. Both GMM and MLP models are trained with different context sizes for fair comparison. But as experimental results of Section 5.1 shows, GMM degrades with K>1.
For training GMM models, AV components of each pair are concatenated and considered as samples of joint PDF p_{av}(.,.). These samples are used for estimation of GMM parameters using maximum likelihood via expectation maximization (EM) algorithm [37]. GMM distributions with various configuration of parameters (k_{ a }, k_{ v }, N_{ M }, K) are trained. To assure good training, for each setting, GMM distribution is trained 20 times using EM with random initialization and the best model is selected based on a validation subset of training data. Regularization by adding a small positive number in range [10^{−10},10^{−2}] to diagonal elements of covariance matrices was adopted to hold positive definiteness where necessary (specially for models with larger random vector dimensions).
MLP AV models are also trained on AV pairs $\left({\mathcal{S}}_{e}\right(k),\mathcal{V}(k\left)\right)$ with ${\mathcal{S}}_{e}\left(k\right)$ as input and $\mathcal{V}\left(k\right)$ as output. MSE criterion between true and estimated outputs $\mathcal{V}\left(k\right)$ and $\widehat{\mathcal{V}}\left(k\right)$ is used as the performance measure in training. This is the same criterion as what is used in contrast function (8). Networks were trained using the LevenbergMarquardt algorithm [38] and via early stopping based on validation subset to avoid overfit. As for GMM, MLP models with various configuration of parameters (k_{ a }, k_{ v }, N_{ H }, K) are trained. To avoid local minima in training, each model is trained 20 times with random initialization and the best model is selected based on the validation subset.
5 Experiments and results
For evaluation of the proposed method, we have conducted four sets of experiments at different stages. First fitness of AV models in capturing AV coherency is evaluated with some initial experiments providing enough data for hyperparameter selection of models. Then, multiple source separation experiments on regular (N×N) and degenerate (M×N,M<N) cases are conducted to compare performance of proposed MLPbased AVSS method with GMMbased AVSS and JADEAV method (defined in Section 3.3.1). Finally, experiments on convolutive 2×2 mixtures with filters of different length are presented to compare performance of the audioonly and the proposed hybrid method.
5.1 Audiovisual models assessment and selection
In this experiment, we preevaluate fitness of AV models and explore the effect of different parameters on their performance. Both MLP and GMMbased AV models need training and have hyperparameters to be selected. We should choose proper dimensions k_{ a } and k_{ v } of acoustic and visual parameters, the embedding context size K, the number of hidden neurons N_{ H } of MLP and the number of Gaussian components N_{ M } of GMM models. Although validation scores of trained models can be used to select best GMM and MLP models, but selection of models based on their capability of discrimination between coherent and incoherent speech is more reasonable since models are aimed to be used for source separation. Furthermore, such an experiment provides insights in virtual potentials of coherencybased AVSS methods.
5.1.1 Audiovisual pure relevant source detection
In this experiment, we compare incoherency scores between a visual stream V^{1} and two pure audio signals: a coherent signal s^{1} and an irrelevant signal s^{2}. For each frame in the test set, the signal which produces minimum incoherency score is recognized to be coherent with V^{1}. Experiments are performed for both AV models ${\mathcal{\mathcal{M}}}_{GMM}(\mathcal{S},\mathcal{V})$ (5) and ${\mathcal{\mathcal{M}}}_{MLP}(\mathcal{S},\mathcal{V})$ (7). For each model, different values of hyperparameters k_{ a } ∈ {2,4,6,8,10,12}, k_{ v }∈{2,4,6,8}, N_{ M },N_{ H }∈{4,8,12,16,20,24,28}, K∈{1,2,3,4,5,6} are examined. Finally, the percent of all frames which signal s^{1} is truly selected is reported as classification accuracy for different values of batch size T.
In [14], authors have evaluated the classification rate just against a single irrelevant signal which is uttered by a different male speaker. Our initial experiments revealed that classification accuracy for different irrelevant signals is variable depending on the speaker, the speech content of signal and alignment of silent parts of coherent and incoherent signals. Thus, to provide classification rates with high confidence, we conducted multiple simulations by performing coherency classification on the relevant signal s^{1}y against six distinct speech signals for s^{2} and reported the average recognition rate as the performance of models.
Furthermore, it is possible that AV models, in addition to AV coherency of speech, capture some parts of AV identity of speaker. To check for this, we chose coherent and irrelevant speech signals both from the same speaker. As much as AV models have captured speaker identity, this provides a classification problem which is more confusing and complex for them relative to choosing irrelevant signals from different speaker(s).
Optimal embedding context size ( K ) and model sizes ( N _{ M } of GMM and N _{ H } of MLP)
GMM(K/N_{ M })  MLP(K/N_{ H })  

k _{ a }  k_{ v }:2  4  6  8  2  4  6  8  
2  1/4  1/16  1/28  1/8  4/20  4/16  4/20  6/28  
4  1/24  1/24  1/24  1/20  4/20  3/16  2/24  3/16  
6  1/28  1/20  1/24  1/24  4/8  2/24  3/16  2/8  
8  1/8  1/8  1/28  1/28  2/20  2/16  2/28  2/16  
10  1/24  1/28  1/28  1/24  2/16  2/28  2/16  2/16  
12  1/8  1/16  1/20  1/20  4/4  2/16  2/20  4/8 
 1.
Accuracy of both MLP and GMM models is enhanced by increasing the number of batch frames T, acoustic features k _{ a } and visual features k _{ v }.
 2.
Among these factors, batch size T has the highest impact and this is followed by k _{ a }; finally, k _{ v } has the lowest impact.
 1.
MLP model performs significantly better relative to GMM in various AV dimensions.
 2.
In lower visual dimensions (k _{ v }=2), MLP outperforms with a 10−15% gap relative to GMM model. In this case, even performance of MLP with worst condition (batch size T=4) is 5−6% higher than GMM with best condition (batch size T=16).
 3.
In higher visual dimensions (k _{ v }=6,8), the difference between GMM and MLP is somewhat reduced.
 4.
The improvements by increasing number of features k _{ a } and k _{ v } is bounded. For k _{ v }>8, in MLP, k _{ v }>6 in GMM and k _{ a }>8 in both models, no more significant enhancement is achieved. The model complexity increases in O(K.k _{ a }+k _{ v }) for MLP and O((K.k _{ a }+k _{ v })^{2}) for GMM and in some point, this results in overcomplex models for the problem (considering the amount of available training data).
 5.
Contrarily, improvements by increasing batch size (T) continues upward and may reach perfect accuracy for enough large T values. This is because the value of T does not change the model size while increasing it introduces more information for decision making. However, it is important to mention that for real AVSS tasks, we cannot increase T arbitrarily. This makes the stationary assumption considered in the mixture model (2) invalid. Hence, there is a tradeoff on the value of T between the AV model accuracy and the mixing model fitness.
 1.
In various k _{ a } and k _{ v }s, GMM always has performed better with K=1 frames in embedded context which means GMM can not capture temporal dynamics by frame embedding due to quadratic order of parameters.
 2.
MLP always has performed better with K=2 frames (for greater k _{ a }) or K=4,6 frames (for smaller k _{ a }) in embedded context showing that it can capture some temporal dynamics.
 3.
Both GMM and MLP models exploit maximum average number of latent units in k _{ v }=6 which seems to be efficient optimal visual dimension size according to results of Figure 3.
5.1.2 Audiovisual mixed relevant source detection
Recall that classification results in Section 5.1.1 are based on comparing the incoherency scores between pure relevant and irrelevant speech signals. This entails that AV models are well suited for selection of a clean relevant source among multiple available irrelevant signals. For example, it will perform well for relevant source selection in AVassisted ICAbased source separation method (i.e. JADEAV) discussed in Section 3.3.1.
Generally, trends of Figure 4 shows similar properties as was discussed for Figure 3. The major point is that classification accuracies of best models on mixed signals ξ_{ i }, ξ_{i+1} is something about 10% less relative to classification of pure relevant and irrelevant signals. Such a degradation is predictable since signals ξ_{ i } and ξ_{i+1} are very similar. But the interesting note is that superior models in the pure classification have approximately kept their superiority in the mixed case. This means that optimal model configurations which are better for classification task, may keep their position in separation task. As before MLP models are superior to GMM models but the large gap between them is somewhat reduced.
5.2 Source separation experiments
5.2.1 Separation performance criterion
The output SIR criterion is widely used in performance evaluation of source separation algorithms when original source signals or mixing systems are available [39]. Since in our experiments, we perform batchwise separation, the output SIR is averaged over all batches in the test set. It is worth to mention that in convolutive mixtures, the SIRs must be calculated up to an allowed arbitrary filtering of the sources. This can be accomplished, using the decomposition method of Vincent et al. [39].
5.2.2 Separation in regular N×N mixtures
In this experiment, we consider regular N×N mixtures with equal number of sources and sensors. Simulations are performed for mixtures of different sizes N=2,3,5 and separation performance in terms of output SIR (21) is presented. Experiments are conducted on the test set of both alphadigits (Persian and English) and poetverses (Persian) corpora (see Section 4.1 for corpus details). Each corpus consists of a pair of synchronous audio and visual streams of frames. From each corpus, 3,000 frames are exploited in separation simulations. The audio stream from test corpus is considered as the relevant source s^{1} and for other N−1 sources, speech signals of the same length are used. These speech signals are selected from a supplementary corpus recorded from other speakers with the same sampling frequency.
Average input (mixed) SIRs in decibels for both corpora and for different M × N mixing matrices
Ch.i  Alpha digits  Poet verses  

2×2  2×3  3×3  5×5  2×2  2×3  3×3  5×5  
Ch. 1  0.6  −2.5  −3.2  −9.0  −0.7  −3.9  −4.5  −10.2  
Ch. 2  −1.8  −4.8  −5.9  −8.6  −3.2  −6.1  −7.3  −9.9  
Ch. 3      0.4  −8.8      −0.9  −10.1  
Ch. 4        −8.2        −9.4  
Ch. 5        −6.7        −8.0 
Average output SIRs in decibels for 2 ×2 case
T  Alpha digits  Poet verses  

JADEAV  GMMAVSS  MLPAVSS  JADEAV  GMMAVSS  MLPAVSS  
4  18.3  24.0  27.7  7.1  11.9  13.3  
8  28.7  29.5  35.1  22.9  19.4  21.4  
16  32.9  35.9  39.9  30.8  30.2  32.1  
32  37.1  39.6  43.8  36.3  36.5  38.5 
Average output SIRs in dB for 3 ×3 case
T  Alpha digits  Poet verses  

JADEAV  GMMAVSS  MLPAVSS  JADEAV  GMMAVSS  MLPAVSS  
8  18.2  19.9  23.8  11.0  9.0  10.1  
16  22.8  25.3  29.1  18.8  18.2  19.8  
32  27.1  28.7  31.6  25.7  25.4  26.7 
Average output SIRs in dB for 5 ×5 case
T  Alpha digits  Poet verses  

JADEAV  GMMAVSS  MLPAVSS  JADEAV  GMMAVSS  MLPAVSS  
8  4.9  11.9  13.8  −1.2  0.4  2.2  
16  16.1  21.9  24.3  10.1  10.3  11.7  
32  20.3  26.2  29.4  17.0  17.2  19.1 
Effect of discrete and continuous speech: The performance of all methods is higher on alphadigits corpus compared to poetverses. Alpha digits corpus is discrete and poetverses corpus is continuous. It is obvious that continuous speech is more complex for AV modeling since lip formations are not well expressed due to speech speed (coarticulation) and also since in continuous corpus there is much number of different words and phonetic contexts which increases the phonetic complexity.
Relative separation performance of methods: In lower mixture sizes (N=2,3), MLPAVSS method provides higher output SIRs relative to GMMAVSS and both of them are superior to JADEAV for alphadigits corpus. In N=3,5 and for poetverses corpus, the performance enhancement gap between AVSS methods and JADEAV is reduced. In this case, performance gain of GMMAVSS is marginal and some times worst relative to JADEAV. The superiority of MLPAVSS relative to GMMAVSS is consistent with classification accuracies of MLP and GMMbased AV models presented in Section 5.1.
Effect of batch integration time (T): The performance of all methods increases with increasing the number of frames in each batch. Increasing the integration time enhances accuracy of contrast functions (6) and (8) (see Section 5.1) and also reduces spurious local minima in the optimization landscape. For JADEAV algorithm, in addition to improved accuracy of AV contrast, increasing integration time allows better estimates of higher order statistics of signals which affect separation quality of JADE algorithm. But recall that in real applications with nonstationary mixtures, there is a tradeoff for increasing number of frames in each batch (see Section 5.1).
5.2.3 Separation in degenerate M×N,M<N mixtures
Average output SIRs in dB for 2 ×3 case
T  Alpha digits  Poet verses  

JADEAV  GMMAVSS  MLPAVSS  JADEAV  GMMAVSS  MLPAVSS  
16  3.9  4.7  5.9  −0.5  −0.8  1.1  
32  4.1  5.2  7.4  −0.6  −0.7  2.9  
64  4.5  6.5  8.9  1.6  2.8  6.7 
In this case, the mixing matrices are degenerate and have not exact inverse. Hence, the perfect recovery of sources is not possible and SIRs are worse relative to regular N×N simulations. AVSS methods show slight improvements relative to JADEAV methods. The performance of MLPAVSS is again superior to GMMAVSS as is predicted. In this experiment, results were highly dependent on mixing matrix. In some mixtures, output SIRs near to 10 dB were achieved while in some others negative output SIRs were observed.
5.2.4 Separation of convolutive 2×2 mixtures
Average output SIRs in dB for 2 ×2 convolutive case
Method/L  L=0  L=2  L=5  L=10  L=15  L=20  L=25 

kurt_{ n }[31]  36.2  10.1  9.6  9.7  8.9  8.5  8.6 
J_{avICA} (14)  43.5  11.7  10.9  11.5  10.9  10.7  10.8 
For L=0, the mixture is instantaneous and separation is possible with high SIR. But for L=2 (filters with five taps) and for higher degree of mixing and demixing filters, the SIR decreases to about average 9 dB for audioonly method of [31] and 11 dB for the hybrid AV coherent and independent method proposed in Section 3.4.
6 Conclusion
In this paper, we proposed an improved AV association model using an MLP which exploits the dependency between AV frames and is superior to the existing GMM AV model. The MLP model makes efficient use of its parameters relative to the GMM model. Hence, unlike the GMM model, it can capture temporal dynamics from a limited context of frames around the current frame to enhance the coherency measure. We also proposed a hybrid criterion which exploits AV coherency together with normalized kurtosis as an independence measure and, based on that, moved toward a timedomain convolutive AVSS method. Experimental results for comparison of the methods are presented in terms of the relevant signal classification accuracy and also the separation output SIRs. Results, confirms the contribution of the proposed neuralbased AV association model in enhancement of AV incoherency scores and hence in improvement of the separation SIRs compared to the existing GMMbased AVSS algorithm and the visually assisted ICA (JADEAV) method. Also, results of the timedomain convolutive method, using hybrid AV criterion shows improvement compared to the reference audioonly method.
For visual parametrization part, we have used normalized PCAprojected (whitened) lip appearance features. PCA features do not need exact lip contour detection and hence require less computation compared to extraction of lip geometric (width and height) parameters. But it also has the drawback of being more sensitive to the speaker and segmentation of the lip region. The fitness of PCA features for AV modeling and AVSS task is justified by qualitative illustrations and numerical results. However, the proposed AVSS method is not coupled to the PCA visual features and it can be adopted with more robust and accurate visual features.
Although proposed model improves quality of AV modeling, but further enhancements is both required and predictable to make these methods applicable in more complex phonetic contexts and speakerindependent situations. AV relation is both nonlinear and stochastic. GMM benefits from its capability in probabilistic modeling. But GMM fails to efficiently handle the nonlinearity and temporal dependency. On the other hand, MLP seems to benefit from its relatively deep structure and efficient use of its parameters, but it does not truly consider stochastic property of AV relation. Further improvements may be gained by introducing a model which can efficiently handle both the nonlinear and the stochastic relations of the two modalities as well as the temporal dependency. Also, it seems promising to consider more essential combinations of ICA and AV coherencybased methods to jointly gain benefits of both informed and blind methods.
Finally, it is worth to mention that in this paper, we did not consider the interbatch temporal dynamics of demixing vectors and separated signals. It is possible to adopt this temporal information for example using a Bayesian recursive filtering approach to improve the performance and speed of proposed methods. Also, it is possible to adaptively determine the working batch size based on the amount of interbatch variations of the demixing vectors.
Appendix
A where Re{.} and Im{.} are real and imaginary part operators. Equations 22, 23 and 24 are derived for the calculation of Jacobian of a single frame. In our MATLAB implementation, we have derived more complex matrix forms which allows calculation of Jacobian of multiple acoustic frames (i.e. all frames in a batch) using efficient vectorized computing. Having Jacobian of individual acoustic frames $\mathcal{Y}\left(k\right)$, we combine the theme to obtain the Jacobian of embedded acoustic vectors ${\mathcal{Y}}_{e}\left(k\right)$. This is done according to the definition of the embedding method E.
Declarations
Acknowledgements
Authors would like to say thanks to the anonymous reviewers for their attention which solved some important presentation issues and also for providing improving suggestions.
Authors’ Affiliations
References
 Sumby WH, Pollack I: Visual contribution to speech intelligibility in noise. J. Acoust. Soc. Am 1954, 26: 212215. 10.1121/1.1907309View ArticleGoogle Scholar
 Summerfield Q: Lipreading and audiovisual speech perception. Philos. Trans. R. Soc. Lond. B Biol. Sci 1992, 335(1273):7178. 10.1098/rstb.1992.0009View ArticleGoogle Scholar
 Grant KW, Seitz PF: The use of visible speech cues for improving auditory detection of spoken sentences. J. Acoust. Soc. Am 2000, 108: 1197. 10.1121/1.1288668View ArticleGoogle Scholar
 McGurk H, MacDonald J: Hearing lips and seeing voices. Nature 1976, 264(5588):746748. 10.1038/264746a0View ArticleGoogle Scholar
 Schwartz JL, Berthommier F, Savariaux C: Seeing to hear better: evidence for early audiovisual interactions in speech identification. Cognition 2004, 93(2):6978. 10.1016/j.cognition.2004.01.006View ArticleGoogle Scholar
 Petajan ED: Automatic lipreading to enhance speech recognition. PhD thesis, University of Illinois, Illinois; 1984.Google Scholar
 Girin L, Schwartz JL, Feng G: Audiovisual enhancement of speech in noise. J. Acoust. Soc. Am 2001, 109(6):30073020. 10.1121/1.1358887View ArticleGoogle Scholar
 Deligne S, Potamianos G, Neti C: Audiovisual speech enhancement with AVCDCN (AudioVisual Codebook Dependent Cepstral Normalization). In Proceedings of the ISCA International Conference on Spoken Language Processing (ICSLP’02). ISCA; 2002:14491452.Google Scholar
 Berthommier F: Audiovisual speech enhancement based on the association between speech envelope and video features. In Proceedings of the ISCA European Conference on Speech Communication and Technology (EUROSPEECH’03). ISCA; 2002:10451048.Google Scholar
 Goecke R, Potamianos G, Neti C: Noisy audio feature enhancement using audiovisual speech data. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’02). IEEE; 2002:20252028.Google Scholar
 Rivet B, Girin L, Jutten C: Visual voice activity detection as a help for speech source separation from convolutive mixtures. Speech Communication 2007, 49(7):667677.View ArticleGoogle Scholar
 Sodoyer D, Schwartz JL, Girin L, Klinkisch J, Jutten C: Separation of audiovisual speech sources: a new approach exploiting the audiovisual coherence of speech stimuli. EURASIP J. Appl. Signal Process 2002, 2002(1):11641173.Google Scholar
 Rajaram S, Nefian AV, Huang TS: Bayesian separation of audiovisual speech sources. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’04). IEEE; 2004:657661.Google Scholar
 Sodoyer D, Girin L, Jutten C, Schwartz JL: Developing an audiovisual speech source separation algorithm. Speech Commun 2004, 44(1):113125.View ArticleGoogle Scholar
 Wang W, Cosker D, Hicks Y, Sanei S, Chambers J: Video assisted speech source separation. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’05). IEEE; 2005:425425.Google Scholar
 Rivet B, Girin L, Jutten C: Mixing audiovisual speech processing and blind source separation for the extraction of speech signals from convolutive mixtures. Audio, Speech, Lang. Process. IEEE Trans 2007, 15(1):96108.View ArticleGoogle Scholar
 Sigg C, Fischer B, Ommer B, Roth V, Buhmann J: Nonnegative CCA for audiovisual source separation. In IEEE Workshop On Machine Learning for Signal Processing (MLSP’07). IEEE; 2007:253258.View ArticleGoogle Scholar
 Casanovas AL, Monaci G, Vandergheynst P, Gribonval R: Blind audiovisual separation based on redundant representations. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’08). IEEE; 2008:18411844.View ArticleGoogle Scholar
 Monaci G, Sommer F, Vandergheynst P: Learning sparse generative models of audiovisual signals. In EURASIP European Signal Processing Conference (EUSIPCO ’08). EURASIP; 2008.Google Scholar
 Monaci G, Vandergheynst P, Sommer FT: Learning bimodal structure in audio–visual data. Neural Netw. IEEE Trans 2009, 20(12):18981910.View ArticleGoogle Scholar
 Casanovas AL, Monaci G, Vandergheynst P, Gribonval R: Blind audiovisual source separation based on sparse redundant representations. Multimedia, IEEE Trans 2010, 12(5):358371.View ArticleGoogle Scholar
 Liang Y, Naqvi SM, Chambers JA: Audio video based fast fixedpoint independent vector analysis for multisource separation in a room environment. EURASIP J. Adv. Signal Process 2012, 2012(1):116. 10.1186/1687618020121View ArticleGoogle Scholar
 Liu Q, Wang W, Jackson PJ, Barnard M, Kittler J, Chambers JA: Source separation of convolutive and noisy mixtures using audiovisual dictionary learning and probabilistic timefrequency masking. IEEE Trans. Signal Process 2013., 61(22):Google Scholar
 Khan MS, Naqvi SM, Rehman A, Wang W, Chambers JA: Videoaided modelbased source separation in real reverberant rooms. IEEE Trans. Audio Speech Lang. Process 2013, 21(9):19001912.View ArticleGoogle Scholar
 Bell AJ, Sejnowski TJ: An informationmaximization approach to blind separation and blind deconvolution. Neural Computat 1995, 7(6):11291159. 10.1162/neco.1995.7.6.1129View ArticleGoogle Scholar
 Hyvärinen A, Oja E: A fast fixedpoint algorithm for independent component analysis. Neural Comput 1997, 9(7):14831492. 10.1162/neco.1997.9.7.1483View ArticleGoogle Scholar
 Cardoso JF: Highorder contrasts for independent component analysis. Neural Comput 1999, 11(1):157192. 10.1162/089976699300016863MathSciNetView ArticleGoogle Scholar
 Jeffers J, Barley M: Speechreading (lipreading). Charles C. Thomas Publisher, Springfield, Illinois; 1971.Google Scholar
 Cappelletta L, Harte N: Phonemetoviseme mapping for visual speech recognition. In Proceedings of the International Conference on Pattern Recognition Applications and Methods (ICPRAM 2012). IEEE; 2012:322329.Google Scholar
 Delfosse N, Loubaton P: Adaptive blind separation of independent sources: a deflation approach. Signal Process 1995, 45(1):5983. 10.1016/01651684(95)00042CMATHView ArticleGoogle Scholar
 Tugnait JK: Identification and deconvolution of multichannel linear nonGaussian, processes using higher order statistics and inverse filter criteria. Signal Process. IEEE Trans 1997, 45(3):658672. 10.1109/78.558482View ArticleGoogle Scholar
 Zarzoso V, Comon P: Robust independent component analysis by iterative maximization of the kurtosis contrast with algebraic optimal step size. Neural Netw. IEEE Trans 2010, 21(2):248261.View ArticleGoogle Scholar
 Gazor S, Zhang W: Speech probability distribution. Signal Process. Lett. IEEE 2003, 10(7):204207.View ArticleGoogle Scholar
 Tashev I, Acero A: Statistical modeling of the speech signal. In Proc. Intl. Workshop on Acoustic, Echo, and Noise Control (IWAENC 2010). IEEE; 2010.Google Scholar
 Thomas J, Deville Y, Hosseini S: Timedomain fast fixedpoint algorithms for convolutive ICA. Signal Process. Lett. IEEE 2006, 13(4):228231.View ArticleGoogle Scholar
 Moayedi F, Kazemi A, Azimifar Z: Hidden Markov modelunscented Kalman filter contour tracking: a multicue and multiresolution approach. In Iranian Conference on Machine Vision and Image Processing (MVIP 2010). IEEE, Piscataway; 2010:16.View ArticleGoogle Scholar
 Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Series B (Methodological) 1977, 39(1):138.MATHMathSciNetGoogle Scholar
 Hagan MT, Menhaj M: Training feedforward networks with the Marquardt algorithm. IEEE Trans. Neural Netw 1994, 5(6):989993. 10.1109/72.329697View ArticleGoogle Scholar
 Vincent E, Gribonval R, Févotte C: Performance measurement in blind audio source separation. Audio, Speech, Lang. Process. IEEE Trans 2006, 14(4):14621469.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.