Our generative model is learned using silhouettes from a set of targets of different classes observed from multiple viewpoints. The learning process identifies a mapping from the HD data space to two LD manifolds corresponding to the shape variations represented in terms of view and identity. In the following, we first discuss the identity and view manifolds. Then we present a non-linear tensor decomposition method that integrates the two manifolds into a generative model for multi-view shape modeling, as shown in Figure 2.

### 3.1 Identity manifold

The identity manifold that plays a central role in our work is intended to capture both inter-class and intra-class shape variability among training targets. In particular, the continuous nature of the proposed identity manifold makes it possible to interpolate valid target shapes between known targets in the training data. There are two important questions to be addressed in order to learn an identity manifold with the desired interpolation capability. The first one is which space this identity manifold should span. In other words, should it be learned from the HD silhouette space or a LD latent space? We expect traversal along the identity manifold to result in gradual shape transition and valid shape interpolation between known targets. This would ideally require the identity manifold to span a space that is devoid of all other factors that contribute to the shape variation. Therefore the identity manifold should be learned in a LD latent space with only the identity factor rather than in the HD data space where the view and identity factors are coupled together. The second important question is how to learn a *semantically valid* identity manifold that supports meaningful shape interpolation for an unknown target. In other words, what kind of constraint should be imposed on the identity manifold to ensure that interpolated shapes correspond to feasible real-world targets? We defer further discussion of the first issue to Section 3.3 and focus here on the second one that involves the determination of an appropriate topology for the identity manifold.

The topology determines the span of a manifold with respect to its connectivity and dimensionality. In this work, we suggest a *1D closed-loop structure* to represent the identity manifold and there are several important considerations to support this seemingly arbitrary but actually practical choice. First, the learning of a higher-dimensional manifold requires a large set of training samples that may not be available for a specific ATR application where only a relatively small candidate pool of possible targets-of-interest is available. Second, this identity manifold is assumed to be closed rather than open, because all targets in our ATR problem are man-made ground vehicles which share some degree of similarity with extreme disparity unlikely. Third, the 1D closed structure would greatly facilitate the inference process for online ATR tasks. As a result, the manifold topology is reduced to a specific *ordering relationship* of training targets along the 1D closed identity manifold. Ideally, we want targets of the same class or those with similar shapes to stay closer on the identity manifold compared with dissimilar ones. Thus we introduce a *class-constrained shortest-closed-path* method to find a unique ordering relationship for the training targets. This method requires a view-independent *distance* or *dissimilarity* measure between two targets. For example, we could use the shape dissimilarity between two 3D target models that can be approximated by the accumulated mean square errors of multi-view silhouettes.

Assume we have a set of training silhouettes from *N* target types belonging to one of *Q* classes imaged under *M* different views. Let {\mathbf{y}}_{m}^{k} denote the vectorized silhouette of target *k* under view *m* (after the distance transform [29]) and let *L*_{
k
} denote its class label, *L*_{
k
} ∈ [1, *Q*] (*Q* is the number of target classes and each class has multiple target types). Also assume that we have identified a LD identity latent space where the *k*'th target is represented by the vector *i*^{k}, *k* ∈ {1, · · ·, *N*} (*N* is the number of total target types). Let the topology of the manifold spanning the space of {*i*^{k}|*k* = 1,..., *N*} be denoted by **T** = [*t*_{1} *t*_{2} ··· *t*_{N+ 1}] where *t*_{
i
} ∈ [1*, N*], *t*_{
i
} ≠ *t*_{
j
} for *i* ≠ *j* with the exception of *t*_{1} = *t*_{N+ 1}to enforce a closed-loop structure. Then the class-constrained shortest-closed-path can be written as

{\mathbf{T}}^{*}=\underset{\mathbf{T}}{arg\; min}\sum _{i=1}^{N}D\left({\mathit{i}}^{{t}_{i}},{\mathit{i}}^{{t}_{i+1}}\right),

(1)

where *D*(*i*^{u}, *i*^{v}) is defined as

D\left({\mathit{i}}^{u},{\mathit{i}}^{v}\right)=\sum _{m=1}^{M}\parallel {\mathbf{y}}_{m}^{u}-{\mathbf{y}}_{m}^{v}\parallel +\beta \cdot \epsilon \left({L}_{u},{L}_{v}\right),

(2)

\epsilon \left({L}_{u},{L}_{v}\right)=\left\{\begin{array}{c}0\phantom{\rule{2.77695pt}{0ex}}\text{if}\phantom{\rule{2.77695pt}{0ex}}{L}_{u}={L}_{v},\hfill \\ 1\phantom{\rule{2.77695pt}{0ex}}\text{otherwise},\hfill \end{array}\right.

(3)

where ||.|| represents the Euclidean distance and *β* is a constant. The first term in (2) denotes a view independent shape similarity measure between targets *u* and *v* as it is averaged over all training views. The second term is a penalty term that ensures targets belonging to the same class to be grouped together. The manifold topology **T**^{*} defined in (1) tends to group targets of similar 3D shapes and/or the same class together, enforcing the best local *semantic smoothness* along the identity manifold, which is essential for a valid shape interpolation between target types.

It is worth mentioning that the identity manifold to be learned according to **T**^{*} will encompass multiple target classes each of which has several sub-classes. For example, we consider six classes of vehicles in this work each of which includes six sub-class types. Although it is easy to understand the feasibility and necessity of shape interpolation within a class to accommodate intra-class variability, the validity of shape interpolation between two different classes may seem less clear. Actually, **T**^{*} not only defines the ordering relationship within each class but also the neighboring relationship between two different classes. For example the six classes considered in this paper are ordered as: *Armored Personnel Carriers* (APCs) → *Tanks* → *Pick-up Trucks* → *Sedan Cars* → *Minivans* → *SUVs* → *APCs*. Although APCs may not look like Tanks or SUVs in general, APCs are indeed located between Tanks and SUVs along the identity manifold according to **T***. It occurs because that (1) finds an APC-Tank pair and an APC-SUV pair that have the *least shape dissimilarity* compared with all other pairs. Thus this ordering still supports sensible inter-class shape interpolation, although it may not be as smooth as intra-class interpolation, as will be shown later in the experiments.

### 3.2 Conceptual view manifold

We need a view manifold to accommodate the view-induced shape variability for different targets. A common approach is to use non-linear DR techniques, such as LLE or Laplacian eigenmaps, to find the LD view manifold for each target type [29]. One main drawback of using identity-dependent view manifolds is that they may lie in different latent spaces and have to be aligned together in the same latent space for general multi-view modeling. Therefore, the view manifold here is designed to be a hemisphere that embraces almost all possible viewing angles around a ground vehicle as shown in Figure 1 and is characterized by two parameters: the azimuth and elevation angles **Θ** = {*θ*, *ϕ*}. This conceptual manifold provides a unified and intuitive representation of the view space and supports efficient dynamic view estimation.

### 3.3 Non-linear Tensor Decomposition

We extend the non-linear tensor decomposition in [35] to develop the proposed generative model. The key is to find a view-independent space for learning the identity manifold through the commonly-shared conceptual view manifold (the first question raised in Section 3.1).

Let {\mathbf{y}}_{m}^{k}\in {\mathbb{R}}^{d}be the *d*-dimensional, vectorized distance transformed silhouette observation of target *k* under view *m*, and let **Θ**_{
m
} = [*θ*_{
m
}, *ϕ*_{
m
}], 0 ≤ *θ*_{
m
} ≤ 2*π*, 0 ≤ *ϕ*_{
m
} ≤ *π*, denote the point corresponding to view *m* on the LD view manifold. For each target type *k*, we can learn a non-linear mapping between {\mathbf{y}}_{m}^{k} and the point **Θ**_{
m
} using the generalized radial basis function (GRBF) kernel as

{\mathbf{y}}_{m}^{k}=\sum _{l=1}^{{N}_{c}}{w}_{l}^{k}\kappa \left(\parallel {\mathbf{\Theta}}_{m}-{\mathbf{S}}_{l}\parallel \right)+\left[1\phantom{\rule{2.77695pt}{0ex}}{\mathbf{\Theta}}_{m}\right]{b}_{l},

(4)

where *κ*(.) represents the Gaussian kernel, {**S**_{
l
}| *l* = 1,..., *N*_{
c
}} are *N*_{
c
} kernel centers that are usually chosen to coincide with the training views on the view manifold, {w}_{l}^{k} are the target specific weights of each kernel and *b*_{
l
} is the coefficient of the linear polynomial [1 **Θ**_{
m
}] term included for regularization. This mapping can be written in matrix form as

{\mathbf{y}}_{m}^{k}={\mathbf{B}}^{k}\psi \left({\mathbf{\Theta}}_{m}\right),

(5)

where **B**^{k} is a *d* × (*N*_{
c
} + 3) target dependent linear mapping term composed of the weight terms {w}_{l}^{k} in (4) and \psi \left({\mathbf{\Theta}}_{m}\right)=\left[\kappa \left(\parallel {\mathbf{\Theta}}_{m}-{\mathbf{S}}_{1}\parallel \right),\phantom{\rule{2.77695pt}{0ex}}\cdots \phantom{\rule{0.3em}{0ex}},\phantom{\rule{2.77695pt}{0ex}}\kappa \left(\parallel {\mathbf{\Theta}}_{m}-{\mathbf{S}}_{{N}_{c}}\parallel \right),\phantom{\rule{2.77695pt}{0ex}}1,\phantom{\rule{2.77695pt}{0ex}}{\mathbf{\Theta}}_{m}\right)] is a target independent non-linear kernel mapping. Since *ψ*(**Θ**_{
m
}) is dependent only on the view angle we reason that the identity related information is contained within the term **B**^{k}. Given *N* training targets, we obtain their corresponding mapping functions **B**^{k} for *k* = {1,..., *N*} and stack them together to form a tensor **C** = [**B**^{1} **B**^{2} ... **B**^{N}] that contains the information regarding the identity. We can use the high-order singular value decomposition (HOSVD) [38] to determine the basis vectors of the identity space corresponding to the data tensor **C**. The application of HOSVD to **C** results in the following decomposition:

\mathbf{C}=\mathbf{A}{\times}_{3}{\mathit{i}}^{k},

(6)

where {*i*^{k}∈ ℝ^{N} |*k* = 1,..., *N*} are the *identity basis vectors*, **A** is the core tensor with dimensionality *d* × (*N*_{
c
} + 3) × N that captures the coupling effect between the identity and view factors, and **×**_{
j
} denotes mode-*j* tensor product. Using this decomposition it is possible to reconstruct the training silhouette corresponding to the *k*'th target under each training view according to

{\mathbf{y}}_{m}^{k}=\mathbf{A}{\times}_{3}{\mathit{i}}^{k}{\times}_{2}\psi \left({\mathbf{\Theta}}_{m}\right).

(7)

This equation supports shape interpolation along the view manifold. This is possible due to the interpolation friendly nature of RBF kernels and the well defined structure of the view manifold. However it cannot be said with certainty that any arbitrary vector *i*∈ span(*i*^{1},..., *i*^{N}) will result in a valid shape interpolation due to the sparse nature of the training set in terms of the identity variation.

To support meaningful shape interpolation, we constrain the identity space to be a 1D structure that includes only those points on a closed B-spline curve connecting the identity basis vectors {*i*^{k}|*k* = 1,..., *N*} according to the manifold topology defined in (1). We refer to this 1D structure as the *identity manifold* denoted by \mathcal{M}\subset {\mathbb{R}}^{N}. Then an arbitrary identity vector \mathit{i}\in \mathcal{M} would be semantically meaningful due to its proximity to the basis vectors, and should support a valid shape interpolation. Although the identity manifold \mathcal{M} has an intrinsic 1D closed-loop structure, it is still defined in the tensor space ℝ^{N}. To facilitate the inference process, we introduce an intermediate representation, i.e., a unit circle as an equivalent of \mathcal{M} parameterized by a single variable. First, we map all identity basis vectors {*i*^{k}|*k* = 1,..., *N*} onto a set of angles uniformly distributed along a unit circle, {*α*_{
k
} = (*k* - 1) * 2*π*/*N*|*k* = 1,..., *N*}.

Then, as shown in Figure 3, for any *α*' ∈ [0, 2*π*) that is between *α*_{
j
} and *α*_{j+ 1}along the unit circle, we can obtain its corresponding identity vector \mathit{i}\text{(}{\alpha}^{\prime}\text{)}\in \mathcal{M} from two closest basis vectors *i*^{j}and *i*^{j+1}via spline interpolation along \mathcal{M} while maintaining the distance ratio defined below:

\frac{\mid {\alpha}^{\prime}-{\alpha}_{j}\mid}{\mid {\alpha}^{\prime}-{\alpha}_{j+1}\mid}=\frac{\mathcal{D}\left(\mathit{i}\left({\alpha}^{\prime}\right),{\mathit{i}}^{j}\mid \mathcal{M}\right)}{\mathcal{D}\left(\mathit{i}\left({\alpha}^{\prime}\right),{\mathit{i}}^{j+1}\mid \mathcal{M}\right)},

(8)

where \mathcal{D}\left(\cdot \mid \mathcal{M}\right) is a distance function defined along \mathcal{M}. Now (7) can be generalized for shape interpolation as

\mathbf{y}\left(\alpha ,\phantom{\rule{2.77695pt}{0ex}}\mathbf{\Theta}\right)=\mathbf{A}{\times}_{3}\mathit{i}\left(\alpha \right){\times}_{2}\psi \left(\mathbf{\Theta}\right),

(9)

where *α* ∈ [0, 2*π*) is the identity variable and i\text{(}\alpha \text{)}\in \mathcal{M} is its corresponding identity vector along the identity manifold in ℝ^{N}. Thus (9) defines a generative model for multi-view shape modeling that is controlled by two continuous variables *α* and **Θ** defined along their own manifolds.