3.1 Observations on cosine dissimilarity
Assume that we are given the representations of two images I1 and I2 that are written by the N-dimensional vector x
i
(i = 1, 2) in the lexicographic order. First, x
i
∈ ℝN is normalized to get x
i
(c) ∈ [0, 1] where c is the element vector index or the vector spatial location. Then, x
i
is mapped into the ℝ2N sphere by
$$ \mathrm{Z}\left({\mathbf{x}}_i\right)=\frac{1}{\sqrt{N}}{\left[c\mathrm{os}{\left({\mathbf{x}}_i\right)}^T\ \sin {\left({\mathbf{x}}_i\right)}^T\right]}^T $$
(3)
where
$$ c\mathrm{os}\left({\mathbf{x}}_i\right)={\left[c\mathrm{os}\left({\mathbf{x}}_i(1)\right),c\mathrm{os}\left({\mathbf{x}}_i(2)\right),\dots, c\mathrm{os}\left({\mathbf{x}}_i(N)\right)\right]}^T, $$
$$ \sin \left({\mathbf{x}}_i\right)={\left[\sin \left({\mathbf{x}}_i(1)\right),\sin \left({\mathbf{x}}_i(2)\right),\dots, \sin \left({\mathbf{x}}_i(N)\right)\right]}^T, $$
and ‖Z(x
i
)‖ = 1.
We have
$$ \mathrm{Z}{\left({\mathbf{x}}_1\right)}^T\mathrm{Z}\left({\mathbf{x}}_2\right)=\frac{1}{N}\sum \limits_{c=1}^Nc\mathrm{os}\left({\mathbf{x}}_1(c)-{\mathbf{x}}_2(c)\right). $$
(4)
Recall that the cosine distance measure between two vectors x and y is given by
$$ d\left(\mathbf{x},\mathbf{y}\right)=1-\frac{{\mathbf{x}}^T\mathbf{y}}{\left\Vert \mathbf{x}\right\Vert \left\Vert \mathbf{y}\right\Vert }. $$
(5)
If the distance between Z(x1) and Z(x2) have the form
$$ \mathrm{d}\left(\mathrm{Z}\left({\mathbf{x}}_1\right),\mathrm{Z}\left({\mathbf{x}}_2\right)\right)=\frac{1}{2}{\left\Vert \mathrm{Z}\left({\mathbf{x}}_1\right)-\mathrm{Z}\left({\mathbf{x}}_2\right)\right\Vert}_F^2, $$
(6)
then it is equal to a cosine-based distance measure and
$$ \mathrm{d}\left(\mathrm{Z}\left({\mathbf{x}}_1\right),\mathrm{Z}\left({\mathbf{x}}_2\right)\right)=1-\frac{1}{N}\sum \limits_{c=1}^Nc\mathrm{os}\left({\mathbf{x}}_1(c)-{\mathbf{x}}_2(c)\right). $$
(7)
It can be seen that, if I1 ≈ I2, e.g., ∀c, x1(c) − x2(c) ≈ 0, then d(Z(x1), Z(x2)) → 0. This implies that if the two images are unrelated, then their local elements are unmatched.
Moreover, the mapping function (3) from ℝN to ℝ2N is equivalent to a mapping function f : ℝN → ℂN defined by:
$$ f\left({\mathbf{x}}_t\right)={\mathbf{z}}_t=\frac{1}{\sqrt{2}}{e}^{i\alpha \pi {\mathbf{x}}_t}=\frac{1}{\sqrt{2}}\left[\begin{array}{c}{e}^{i\alpha \pi {\mathbf{x}}_t(1)}\\ {}\vdots \\ {}{e}^{i\alpha \pi {\mathbf{x}}_t(N)}\end{array}\right] $$
(8)
where the Euler’s formula [31] is
$$ {\mathrm{e}}^{i\;\alpha \pi {\mathbf{x}}_t}=\cos \left(\alpha \pi {\mathbf{x}}_t\right)+ isin\left(\alpha \pi {\mathbf{x}}_t\right) $$
(9)
Therefore, the cosine dissimilarity of a data pair in the input real space equals to the Frobenious distance of the corresponding data pair in the complex domain. It is known that the robustness of cosine dissimilarity in the real domain has been found to suppress outliers [32]. With the idea of utilizing different robust similarity metrics to extend NMF, we introduce a new dimensionality reduction method (proCMF) which relates to conventional proNMF and uses the cosine dissimilarity metric as a measurement of the reconstruction error. However, the complexity of optimizing the real function with the cosine distance is wisely addressed by converting to a complex optimization problem with the Frobenius norm which is described in the next sections.
3.2 Problem formulation
In this section, we formulate the problem of multi-variants data factorization within the framework of complex data decomposition. Given the sample dataset X = [x1, x2, …, x
M
], x
i
∈ ℝN, we convert the real data matrix X ∈ ℝN × M to a complex matrix Z ∈ ℂN × M by the mapping (8) and perform matrix factorization in this complex feature space.
The basic idea of proCMF is the coefficient of each data point z
i
(i = 1, 2, …, M) that lies within the subspace spanned by the column vectors of one projection matrix. The coefficient matrix H ∈ ℂK × M is obtained by linear transformation from samples. More specifically, given a matrix Z ∈ ℂN × M, we need to find out two matrices W ∈ ℂN × K and H ∈ ℂK × M to minimize \( {\left\Vert \mathbf{Z}-\mathbf{WH}\right\Vert}_F^2\ \mathrm{s}.\mathrm{t}\ \mathbf{H}=\mathbf{VZ} \) where V ∈ ℂK × N is the projection matrix. The proCMF objective function is as follows:
$$ \underset{\mathbf{W},\mathbf{V}}{\min }{\mathrm{O}}_{proCMF}\left(\mathbf{W},\mathbf{V}\right)=\underset{\mathbf{W},\mathbf{V}}{\min}\frac{1}{2}{\left\Vert \mathbf{Z}-\mathbf{WVZ}\right\Vert}_F^2 $$
(10)
3.3 Complex-valued gradient decent method
It can be seen that (10) is a nonconvex minimization problem with respect to both variables W and V. Therefore, they are impractical to obtain the optimal solution. This NP-hard problem can be tackled by applying the block coordinate descent (BCD) with two matrix blocks [33] to obtain a local solution by the following scheme:
Given an initial W(0), we find the optimal solution V(t+1) such that:
$$ {V}^{\left(t+1\right)}=\arg \underset{\mathbf{V}}{\min }{\mathrm{O}}_{\mathrm{proCMF}}\left({\mathbf{W}}^{(t)},\mathbf{V}\right)=\frac{1}{2}{\left\Vert \mathbf{Z}-{\mathbf{W}}^{(t)}\mathbf{VZ}\right\Vert}_F^2. $$
(11)
Because of no nonnegative constraint, the basis can be updated simply by the Moore–Penrose pseudoinverse [34]
$$ {\mathbf{W}}^{\left(t+1\right)}=\mathbf{Z}{\left({\mathbf{V}}^{\left(t+1\right)}\mathbf{Z}\right)}^{\dagger }. $$
(12)
To find optimal solutions of (11), we use the complex-valued gradient descent algorithm (CGD). Here, (11) is considered as a real-valued function that is needed to be minimized subject to the complex variable V. Generally, the problem of one scalar function of a complex variable (11) is defined as to solve the following unconstrained optimization problem:
$$ \underset{\mathbf{V}}{\min }f\left(\mathbf{V}\right) $$
(13)
where
$$ {\displaystyle \begin{array}{l}f\left(\mathbf{V}\right)=\frac{1}{2}{\left\Vert \mathbf{Z}-\mathbf{WVZ}\right\Vert}_F^2=\frac{1}{2}\mathrm{Trace}{\left(\mathbf{Z}-\mathbf{WVZ}\right)}^H\left(\mathbf{Z}-\mathbf{WVZ}\right)\\ {}=\frac{1}{2}\mathrm{Trace}\left({\mathbf{Z}}^H\mathbf{Z}-{\mathbf{Z}}^H{\mathbf{V}}^H{\mathbf{W}}^H\mathbf{Z}-{\mathbf{Z}}^H\mathbf{WVZ}+{\mathbf{Z}}^H{\mathbf{V}}^H{\mathbf{W}}^H\mathbf{WVZ}\right),\end{array}} $$
(14)
and (.)H is the matrix Hermitian operation.
Let V = Re(V) + iIm(V) where i is the imaginary unit and i2 = − 1. Then, f(V) can be viewed as a real bivariate function of its real and imaginary components.
In most of complex-variable optimization problems, the objective functions are the real-valued functions of complex arguments. They are not analytic on the complex plane and do not satisfy Cauchy-Riemann conditions. Brandwood’s analytic theory [35] can be applied to overcome this common problem. Recall that if f(V) satisfies Brandwood’s analytic condition, i.e., f(V) is analytic with respect to the complex-valued variable V and its complex conjugate \( \overline{\mathbf{V}} \) where V and \( \overline{\mathbf{V}} \) are treated as independent variables, then the first-order Taylor expansion of \( f\left(\mathbf{V},\overline{\mathbf{V}}\right) \) is as follows:
$$ \Delta f=\left\langle {\nabla}_{\overline{\mathbf{Z}}}f,\Delta \mathbf{Z}\right\rangle +\left\langle {\nabla}_{\mathbf{Z}}f,\Delta \overline{\mathbf{Z}}\right\rangle =2\operatorname{Re}\left\{\left\langle {\nabla}_{\overline{\mathbf{Z}}}f,\Delta \mathbf{Z}\right\rangle \right\} $$
(15)
and the complex gradient of f(V) is defined as
$$ {\nabla}_{\overline{\mathbf{V}}}f\left(\mathbf{V},\overline{\mathbf{V}}\right)=\frac{\partial f\left(\mathbf{V}\right)}{\operatorname{Re}\left(\mathbf{V}\right)}+i\frac{\partial f\left(\mathbf{V}\right)}{\operatorname{Im}\left(\mathbf{V}\right)}. $$
(16)
Therefore, the function f(V) is treated as \( f\left(\mathbf{V},\overline{\mathbf{V}}\right) \), where
$$ f\left(\mathbf{V},\overline{\mathbf{V}}\right)=\frac{1}{2} Trace\left[{\mathbf{Z}}^H\mathbf{Z}-{\mathbf{Z}}^H{\left(\overline{\mathbf{V}}\right)}^T{\mathbf{W}}^H\mathbf{Z}-{\mathbf{Z}}^H\mathbf{WVZ}+{\mathbf{Z}}^H{\left(\overline{\mathbf{V}}\right)}^T{\mathbf{W}}^H\mathbf{WVZ}\right] $$
(17)
The gradient of \( f\left(\mathbf{V},\overline{\mathbf{V}}\right) \) with respect to V is given by:
$$ {\nabla}_{\overline{\mathbf{V}}}f\left(\mathbf{V},\overline{\mathbf{V}}\right)=-{\mathbf{W}}^H{\mathbf{ZZ}}^H+{\mathbf{W}}^H{\mathbf{W}\mathbf{VZZ}}^H. $$
(18)
The gradient decent method for the unconstrained optimization problem in (13) builds a sequence {V(t)}t ∈ ℕ according to the following iterative form:
$$ {\mathbf{V}}^{\left(t+1\right)}={\mathbf{V}}^{(t)}-{\beta}^{(t)}{\nabla}_{\overline{\mathbf{V}}}f\left({\mathbf{V}}^{(t)},{\overline{\mathbf{V}}}^{(t)}\right) $$
(19)
where β(t) is the step size variable, which is a small positive constant minimizing \( f\left({\mathbf{V}}^{(t)}-{\beta}^{(t)}{\nabla}_{\overline{\mathbf{V}}}f\left({\mathbf{V}}^{(t)},{\overline{\mathbf{V}}}^{(t)}\right)\right) \) over ℝ. In this paper, backtracking line search, which is also known as the Armijo rule [36], is used to estimate the step size. In this rule, \( {\beta}^{(t)}={\mu}^{k_t} \), where 0 < μ <1 and k
t
is the first non-negative integer k such that:
$$ f\left({\mathbf{V}}^{\left(t+1\right)},{\overline{\mathbf{V}}}^{\left(t+1\right)}\right)-f\left({\mathbf{V}}^{(t)},{\overline{\mathbf{V}}}^{(t)}\right)\le 2\sigma \operatorname{Re}\left\{\left\langle {\nabla}_{\overline{\mathbf{V}}}f\left({\mathbf{V}}^{(t)},{\overline{\mathbf{V}}}^{(t)}\right),{\mathbf{V}}^{\left(t+1\right)}-{\mathbf{V}}^{(t)}\right\rangle \right\}. $$
(20)
There always exists a step length β(t), which is among 1, μ1, μ2, … . A stationary point of (13) also exists among the limit points of {V(t)}t ∈ ℕ [37]. The iteration will be stopped when the solution is close to a stationary point. Practically, a common condition to check if a point V(t) is close to a stationary point is:
$$ {\left\Vert {\nabla}_{\overline{\mathbf{V}}}f\left({\mathbf{V}}^{(t)},{\overline{\mathbf{V}}}^{(t)}\right)\right\Vert}_F\le \varepsilon $$
(21)
where ε is a pre-defined threshold.
The following Algorithm 2 summarizes the optimization process of the proposed proCMF model.