### 2.1. From FDBSS to IVA

In real-world acoustic environment, signals are mixed with each other, as well as their delays, attenuations, and reverberations, i.e., signals are convolutively mixed together. Supposing there are *N* sources and *M* sensors (*M* ≥ *N*), the signal captured by sensor *m* can be modeled as (1) [14], where ★ is the convolution operation, *a*
_{
mn
}(*t*) is the finite duration impulse response mixing filter from source *n* to sensor *m*.

{x}_{m}\left(t\right)={\displaystyle {\sum}_{n=1}^{N}{a}_{\mathit{mn}}\left(t\right)}\u2605{s}_{n}\left(t\right)

(1)

When STFT is used, if the STFT frame length is sufficiently longer than the mixing filter length [14], the time domain convolution in (1) can approximately be converted to the frequency domain multiplication in (2), where {s}_{n}^{\left[f\right]}\left(\tau \right), {x}_{m}^{\left[f\right]}\left(\tau \right), and {a}_{\mathit{mn}}^{\left[f\right]} are frequency domain versions of *s*
_{
n
}(*t*), *x*
_{
m
}(*t*), and *a*
_{
mn
}(*t*), respectively. For all sources {\mathit{s}}^{\left[f\right]}\left(\tau \right)={\left[{s}_{1}^{\left[f\right]}\left(\tau \right),\dots ,{s}_{N}^{\left[f\right]}\left(\tau \right)\right]}^{\mathrm{T}} and sensors {\mathit{x}}^{\left[f\right]}\left(\tau \right)={\left[{x}_{1}^{\left[f\right]}\left(\tau \right),\dots ,{x}_{M}^{\left[f\right]}\left(\tau \right)\right]}^{\mathrm{T}}, the complete mixing process can be formulated as (3), where *A*
^{[f]} is the mixing matrix for frequency bin *f*, with {a}_{\mathit{mn}}^{\left[f\right]} as its entries.

{x}_{m}^{\left[f\right]}\left(\tau \right)={\displaystyle {\sum}_{n=1}^{N}{a}_{\mathit{mn}}^{\left[f\right]}\phantom{\rule{0.12em}{0ex}}{s}_{n}^{\left[f\right]}}\left(\tau \right)

(2)

{\mathit{x}}^{\left[f\right]}\left(\tau \right)={\mathit{A}}^{\left[f\right]}{\mathit{s}}^{\left[f\right]}\left(\tau \right)

(3)

{\mathit{y}}^{\left[f\right]}\left(\tau \right)={\mathit{W}}^{\left[f\right]}{\mathit{x}}^{\left[f\right]}\left(\tau \right)

(4)

Since signals are instantaneously mixed in each frequency bin, complex-valued ICA algorithms like [5, 6] can be used to separate signals, as depicted in (4), where *W*
^{[f]} is the demixing matrix for frequency bin *f*, which is estimated by ICA. FDBSS utilizes (4) to separate signals, an example of 2 × 2 FDBSS demixing model is shown in Figure 1a. In this example, each horizontal layer is an ICA demixing model in (4) for each frequency bin, and the demixing procedure is carried out in layers independently. Since ICA in different layers may output the separated results in different order, the permutation ambiguity will occur in FDBSS, which is indicated by the different color of *y*
^{[f]} in Figure 1a. The permutation ambiguity must be carefully addressed by algorithms like [7–12] before the inverse STFT is performed, or else the separation procedure will fail.

In addition to separate sources in each frequency bin, IVA utilizes inter-frequency bin information to solve the permutation problem in the separation procedure. The IVA model is very similar with the FDBSS model, as shown in Figure 1b. Their difference is that signals are considered as vectors in IVA, i.e. {\mathit{x}}_{m}={\left[{x}_{m}^{\left[1\right]},\dots ,{x}_{m}^{\left[F\right]}\right]}^{T},\phantom{\rule{0.5em}{0ex}}{\mathit{y}}_{n}={\left[{y}_{n}^{\left[1\right]},\dots ,{y}_{n}^{\left[F\right]}\right]}^{\mathrm{T}} (vertical bars in Figure 1b), and they will be optimized as multivariate variables, instead of independent scalars like in ICA. The IVA model can also be formulated in a single equation: After data in each layer are concatenated into vectors as: *x* = [*x*
^{[1]}; …; *x*
^{[F]}], *y* = [*y*
^{[1]}; …; *y*
^{[F]}], and *W* is a block diagonal matrix with each *W*
^{[f]} in its diagonal, the demixing procedure can be denoted as: *y* = *Wx*, just as the same expression as ICA.

### 2.2. IVA objective function

Mutual information I(·) is a natural measure of independence, which is minimized to zero when random variables are mutually independent, and it is often employed as the objective function in ICA. Mutual information can be calculated in the form of KL divergence KL(·∥·) in (5), where *p*
_{
y
} denotes the probability density function (PDF) of a random vector *y*, {p}_{{y}_{n}} denotes the *n* th marginal PDF of *y*, and *z* is a dummy variable for the integral [16].

\mathrm{I}\left(\mathit{y}\right)=\mathrm{KL}\left({p}_{\mathit{y}}\underset{n}{{\displaystyle \parallel \mathbf{\prod}}}{p}_{{\mathit{y}}_{n}}\right)={\displaystyle \int}{p}_{\mathit{y}}\left(\mathit{z}\right)\mathrm{log}\frac{{p}_{\mathit{y}}\left(\mathit{z}\right)}{{{\displaystyle \mathbf{\prod}}}_{n}{p}_{{\mathit{y}}_{n}}\left({z}_{n}\right)}\mathit{d}\mathit{z}

(5)

IVA objective function has the similar form as (5); however, each *y*
_{
n
} in IVA is a vector rather than a scalar. The IVA objective function and the corresponding derivations are given in (6) [16, 17], where H(·) represents the entropy.

\begin{array}{ll}{\mathcal{J}}_{\mathrm{IVA}}& =\mathbf{KL}\left({p}_{\mathit{y}}\parallel {\displaystyle \mathbf{\prod}}_{n}{p}_{{\mathit{y}}_{\mathit{n}}}\right)\\ ={\displaystyle {\mathbf{\sum}}_{n=1}^{N}\mathrm{H}\left({\mathit{y}}_{n}\right)-\mathrm{H}\left({\mathit{y}}_{1};\dots ;{\mathit{y}}_{N}\right)}\\ ={\displaystyle {\mathbf{\sum}}_{n=1}^{N}\mathrm{H}\left({\mathit{y}}_{n}\right)-\mathrm{H}\left({\mathit{y}}^{\left[1\right]};\dots ;{\mathit{y}}^{\left[F\right]}\right)}\\ ={\displaystyle {\mathbf{\sum}}_{n=1}^{N}\mathrm{H}\left({\mathit{y}}_{n}\right)-\mathrm{H}\left({\mathit{W}}^{\left[1\right]}{\mathit{x}}^{\left[1\right]};\dots ;{\mathit{W}}^{\left[F\right]}{\mathit{x}}^{\left[F\right]}\right)}\\ ={\displaystyle {\mathbf{\sum}}_{n=1}^{N}\mathrm{H}\left({\mathit{y}}_{n}\right)-\mathrm{H}\left(\mathit{Wx}\right)}\\ ={\displaystyle {\mathbf{\sum}}_{n=1}^{N}\mathrm{H}\left({\mathit{y}}_{n}\right)-{\displaystyle \mathbf{\sum}}_{f=1}^{F}\mathrm{log}\left|det\left({\mathit{W}}^{\left[f\right]}\right)\right|-C}\end{array}

(6)

In formula (6), the last equation is derived since H(*Wx*) = log|det(*W*)| + H(*x*) holds for a linear invertible transformation *W*, and the determinant of the block diagonal matrix det\left(\mathit{W}\right)={\displaystyle {\mathbf{\prod}}_{f=1}^{F}\mathrm{det}\left({\mathit{W}}^{\left[f\right]}\right)}. The term *C* = H(*x*) is a constant because the observed signals will not change in the optimization procedure [16, 17].

When the observed signals in each frequency bin are centered and whitened (*x* **←** *x* - E(*x*) so that E(*x*) = 0, then *x* **←** *Vx* so that E(*xx*
^{H}) = *I*, E(·) for expectation, *V* is the whitening matrix), the demixing matrices *W*
^{[f]} become orthonormal, so the term {\displaystyle {\mathbf{\sum}}_{f=1}^{F}\mathrm{log}\left|det\left({\mathit{W}}^{\left[f\right]}\right)\right|} becomes zero. Then, by noting that \mathrm{H}\left({\mathit{y}}_{n}\right)={\displaystyle {\mathbf{\sum}}_{f}\mathrm{H}\left({y}_{n}^{\left[f\right]}\right)-\mathrm{I}\left({\mathit{y}}_{n}\right)}, minimizing the IVA objective function in (6) is equivalent to minimizing (7) [17].

{\mathcal{J}}_{\mathrm{IVA}}={\displaystyle {\mathbf{\sum}}_{n}\left({\displaystyle {\mathbf{\sum}}_{f}\mathbf{H}\left({y}_{n}^{\left[f\right]}\right)-\mathbf{I}\left({\mathit{y}}_{n}\right)}\right)}

(7)

From here we can see that minimization of (7) balances the minimization of the \mathrm{H}\left({y}_{n}^{\left[f\right]}\right) term and the maximization of the I(*y*
_{
n
}) term. According to the basic ICA theory, independency is measured by non-Gaussianity, and minimizing \mathrm{H}\left({y}_{n}^{\left[f\right]}\right) is equivalent to maximizing the non-Gaussianity, which is responsible for separating data in individual frequency bins. Meanwhile, maximizing I(*y*
_{
n
}) means enhancing the dependency of entries in *y*
_{
n
}, which is responsible for solving the permutation problem. In short, minimizing the IVA objective function can simultaneously separate the data and solve the permutation problem [17].

### 2.3. Optimization procedures

To minimize the objective function in (6), the entropy of the estimated source vectors must be calculated. Although the actual PDF of each *y*
_{
n
} is unknown, a prior target PDF \hat{p}\left({\mathit{y}}_{n}\right) is often used, so the objective function in (6) can be simplified as in (8) [14].

{\mathcal{J}}_{\mathrm{IVA}}=-{\displaystyle {\sum}_{n}\mathrm{E}\left(\mathrm{log}\hat{p}\left({\mathit{y}}_{n}\right)\right)}

(8)

Natural gradient descent and fast fixed-point iteration are two frequently used optimization methods in IVA. In the natural gradient-based approach [13, 14], after differentiating the objective function with respect to the demixing matrices, the updating rule can be formulated as (9)

{\mathit{W}}^{\left[f\right]}\leftarrow {\mathit{W}}^{\left[f\right]}+\eta \left\{\mathit{I}-\mathrm{E}\left[{\mathrm{\phi}}^{\left[f\right]}\left({\mathit{y}}^{\left[f\right]}\right){\left({\mathit{y}}^{\left[f\right]}\right)}^{\mathrm{H}}\right]\right\}{\mathit{W}}^{\left[f\right]}

(9)

In this equation, *η* is the learning rate, and φ^{[f]}(·) is a multivariate nonlinear function (also called score function) for frequency bin *f*. This nonlinear function is highly related to the chosen source prior PDF:

{\mathrm{\phi}}^{\left[f\right]}\left({\mathit{y}}_{n}\right)=-\frac{\partial \mathrm{log}\hat{p}\left({y}_{n}^{\left[1\right]},\dots ,{y}_{n}^{\left[F\right]}\right)}{\partial {y}_{n}^{\left[f\right]}}

(10)

In [15, 16], a FIVA algorithm was proposed. Compared with the natural gradient-based approach, the convergence speed of FIVA is dramatically improved, and there is no need to choose the learning rate manually. After applying a nonlinear mapping G, the FIVA objective function can be transformed from (8) to (11) [15, 16]. The corresponding updating rule can be formulated in (12), followed by the symmetric decorrelation scheme in (13). In (12), {\left({\mathit{w}}_{n}^{\left[f\right]}\right)}^{\mathrm{H}} represents the *n* th row of the demixing matrix *W*
^{[f]}. In (13), the inverse square root of a symmetric matrix *W*
^{- 1/2} = *PD*
^{- 1/2}
*P*
^{H}, and *W* = *PDP*
^{H} is the eigendecomposition of *W*.

{\mathcal{J}}_{\mathrm{FIVA}}={\displaystyle {\mathbf{\sum}}_{n=1}^{N}\mathrm{E}\left[\mathrm{G}\left(|{\mathit{y}}_{n}{|}^{2}\right)\right]={\displaystyle \mathbf{\sum}}_{n=1}^{N}\mathrm{E}\left[\mathrm{G}\left({\displaystyle \mathbf{\sum}}_{f=1}^{F}|{y}_{n}^{\left[f\right]}{|}^{2}\right)\right]}

(11)

{\mathit{w}}_{n}^{\left[f\right]}\leftarrow \mathrm{E}\left[{\mathrm{G}}^{\text{'}}\left({\left|{\mathit{y}}_{n}\right|}^{2}\right)+{\left|{y}_{n}^{\left[f\right]}\right|}^{2}{\mathrm{G}}^{\text{'}\text{'}}\left({\left|{\mathit{y}}_{n}\right|}^{2}\right)\right]{\mathit{w}}_{n}^{\left[f\right]}-\mathrm{E}\left[{\left({y}_{n}^{\left[f\right]}\right)}^{*}{\mathrm{G}}^{\text{'}}\left({\left|{\mathit{y}}_{n}\right|}^{2}\right){\mathit{x}}^{\left[f\right]}\right]

(12)

{\mathit{W}}^{\left[f\right]}\leftarrow {\left({\mathit{W}}^{\left[f\right]}{\left({\mathit{W}}^{\left[f\right]}\right)}^{\mathrm{H}}\right)}^{-1/2}{\mathit{W}}^{\left[f\right]}

(13)

Although the original nonlinearity G used in (11) is also derived from the source prior PDF as: \mathrm{G}\left(|{\mathit{y}}_{n}{|}^{2}\right)=-\mathrm{log}\hat{p}\left({\mathit{y}}_{n}\right)[15, 16], nonlinearities in FIVA should be considered as entropy estimators, so, different nonlinearities can also be used, which may not have a direct association with source prior PDF. For example, \mathrm{G}\left(\xb7\right)=\sqrt{\xb7} and G(·) = log(·) are two frequently used nonlinear functions.

When the IVA updating rules in (9) and (12) are compared with the corresponding updating rules in conventional InfomaxICA [5] and complex-valued FastICA [6], one can find that they have nearly the same expressions, the only difference is the improvement from univariate nonlinearities to multivariate nonlinearities. It means that multivariate nonlinearity is very important for IVA algorithms, choosing proper nonlinearities will improve the source separation performance.