Let \(\mathbb {O}=\left \{O_{r},y_{r}\right \}_{r=1}^{R}\) be a set of R labeled sequences of length T where \(O_{r}=\left \{O_{r}^{(1)},\cdots,O_{r}^{(T)}\right \}\) and y
r
∈{1,⋯,C} is the label (class) of sequence O
r
. First, we need to identify subgroups of observations that have common patterns. Ground truth information could not be used for this task as it is insufficient and unreliable. For instance, a large deep buried mine can have a signature similar to a small shallow buried mine. Furthermore, the same mine buried at the same depth in soil with different properties may have different signatures. Thus, the partitioning needs to be done in an unsupervised way, i.e., regardless of the observation’s labels and the limited ground truth information. In our approach, we use unsupervised learning to cluster the set of all observations, \(\mathbb {O}\), into subgroups of “similar” observations. The first step in this approach is to define a measure of similarity between two observations.
4.1 Similarity between observations in the log-likelihood space
4.1.1 4.1.1 Fitting individual models to sequences
Initially, each sequence in the training data, O
r
, 1≤r≤R is used to learn an HMM model λ
r
. Even though using only one sequence of observations to learn an HMM might lead to over-fitting, this technique is only an intermediate step that aims to capture the characteristics of each sequence. The produced HMM model is meant to give a maximal description of each sequence, and therefore, over-fitting is not an issue in this context. In fact, it is desired that the model perfectly fits the observation sequence. In this case, the likelihood of each sequence with respect to its corresponding model is expected to be higher than those with respect to the remaining models.
Let \(\left \{\lambda _{r}^{(0)}\right \}_{r=1}^{R}\) be the set of initial models and let \(s_{n}^{(r)}, \ 1 \leq n \leq N\), be the representative of each state in \(\lambda _{r}^{(0)}\). Each model has N states. First, the model states can be assigned to the sequence observations either heuristically, using domain knowledge, or automatically by clustering the sequence observations into N clusters. In our approach, we use the latter and we define the states’ means and observations as the center and elements of each resulting cluster, respectively. Consequently, the transition matrix and the initial probabilities of \(\lambda _{r}^{(0)}\) are set according to the aforementioned associations. For the emission probabilities, the initialization differs whether we use the discrete or continuous HMM.
For the discrete case, the codewords {v
1,⋯v
M
} of the initial individual DHMM model are the actual observations of the sequence {O
1,⋯O
T
}. The emission probability of each codeword in each state is inversely proportional to their distance to the mean of that state. We use
$$ b_{n}(m) = \frac{\frac{1}{\left\| v_{m}-s_{n}\right\|}}{\sum_{l=1}^{N}\frac{1}{\left\|v_{m} -s_{l} \right\|}}, 1 \leq m \leq M, 1 \leq n \leq N. $$
((3))
To satisfy the requirement that \(\sum _{m=1}^{M}b_{n}(m)=1\), we normalize the values using
$$ b_{n}(m) \longleftarrow \frac{b_{n}(m)}{\sum_{l=1}^{M}b_{n}(l)}, 1 \leq m \leq M, 1 \leq n \leq N. $$
((4))
In the continuous case, the emission probability density functions are modeled by mixtures of Gaussians. In the case of individual sequence models, as the number of observations is small, we use a single component mixture for each state. Thus, the observations belonging to each state are used to estimate the mean and covariance of that state’s component. We use
$$ \mu_{n} = \text{mean} \left\{O_{t} | O_{t} \in s_{n}\right\}, 1 \leq n \leq N, $$
((5))
and
$$ \Sigma_{n} = \text{covariance} \left\{O_{t} | O_{t} \in s_{n}\right\}, 1 \leq n \leq N. $$
((6))
Then, the Baum-Welch algorithm [21] is used to adapt the model parameters to each given observation. Let \(\{\lambda _{r}\}_{r=1}^{R}\) be the set of trained individual models.
Next, we need to define a measure that evaluates the similarity between pairs of observation sequences. While similarity between static data observations is straightforward and well defined, defining a similarity between observation sequences is more of a challenge. Within the context of HMM modeling, we consider two observation sequences similar if: (i) they fit each other’s models; and (ii) they have similar Viterbi optimal paths [17].
4.1.2 4.1.2 Log-likelihood-based similarity
The log-likelihood, L(i,j), of sequence O
i
being generated from model λ
j
reflects the degree to which O
i
fits λ
j
and is defined as:
$$ \textbf{L}(i,j)=\log{Pr\left(O_{i}|\lambda_{j}\right)}. $$
((7))
In (7), L can be computed using the forward–backward procedure mentioned in Section 2.1. When the log-likelihood value is high, it is likely that model λ
j
generated sequence O
i
. In this case, sequences O
i
and O
j
are expected to have common salient features and are considered to be similar. On the other hand, when the likelihood term is low, it is unlikely that model λ
j
generated the sequence O
i
. In this case, O
i
and O
j
are considered to be dissimilar. For each observation sequence O
r
, 1≤r≤R, we compute its likelihood in each model λ
p
, P
r(O
r
|λ
p
), for 1≤p≤R. This will result in an R×R log-likelihood matrix.
4.1.3 4.1.3 Path-mismatch-based penalty
The likelihood-based similarity may not be always accurate. In fact, some observations can have high likelihood in a visually different model. This occurs when most of the elements of a sequence partially match only one or two of the states of the model. In this case, the observation sequence can have a high likelihood in the model but its optimal Viterbi path will deviate from the typical path. To alleviate this problem, we introduce a penalty term, P(i,j), to the log-likelihood measure that is related to the mismatch between the most likely sequence of hidden states of the test sequence (O
i
) and that of the generating sequence (O
j
), i.e.,
$$ \mathbf{P}(i,j)=\mathbf{D_{Edit}}\left(Q^{(ji)}, Q^{(jj)}\right), \quad 1 \leq i,j \leq R. $$
((8))
In (8), P(i,j) is the distance between the Viterbi optimal path, Q
(ji), of testing sequence O
i
with model λ
j
, and the Viterbi optimal path of testing sequence O
j
with model λ
j
, Q
(jj). In (8), D
Edit
is the “edit distance” [17], commonly used in string comparisons. The “edit distance” between two strings, say p and q, is defined as the minimum number of single-character edit operations (deletions, insertions, and/or replacements) that would convert p into q. The Viterbi path mismatch term is intended to ensure that similar sequences have few mismatches in their corresponding Viterbi optimal paths. Since the Viterbi path is already available when using the forward–backward procedure for the likelihood computation, the penalty term does not require significant additional computation.
Finally, we define the similarity, S, between two sequences O
i
and O
j
by combining (7) and (8):
$$ \textbf{S}(i,j)=\alpha\textbf{L}(i,j)-(1-\alpha)\textbf{P}(i,j). $$
((9))
In (9), the mixing factor, α∈[0,1], is a trade-off parameter between the log-likelihood-based similarity and the Viterbi-path-mismatch-based dissimilarity. It is estimated experimentally by maximizing the intra-class similarity and minimizing the inter-class similarity across the training data. A larger value of α corresponds to a dominant log-likelihood-based similarity where the need for the penalty mismatch is not significant. A smaller α corresponds to a more significant path mismatch penalty.
Using (9) to compute the similarity between all pairs of observations results in a similarity matrix that is not symmetric. Thus, we use the following three-step symmetrization scheme to transform it into a pairwise distance matrix:
$$ \left\{ \begin{array}{ll} 1.\quad \textbf{D}(i,j)=-\textbf{S}(i,j) \quad 1 \leq i,j \leq R \\ 2.\quad \textbf{D}(i,i)=0, \quad 1 \leq i \leq R \\ 3.\quad \textbf{D}(i,j)=\max(\textbf{D}(i,j),\textbf{D}(j,i)),\quad 1 \leq i,j\leq R. \end{array} \right. $$
((10))
4.2 Pairwise distance-based clustering
The distance matrix, computed using (10), reflects the degree to which pairs of sequences are considered similar. The largest variation is expected to be between sequences from different classes. Other significant variations may exist within the same class, e.g., the groups of signatures shown in Fig. 3. Our goal is to identify the similar groups so that one model can be learned for each group. This task can be achieved using any relational clustering algorithm. In our work, we use the standard agglomerative hierarchical algorithm [18].
Agglomerative hierarchical clustering is a bottom–up approach that starts with each data point as a cluster. It then proceeds by merging the most similar clusters to produce a sequence of clusters. Several measures have been used to assess the similarity between clusters [18]. Examples include single link, complete link, average link, and ward distance. The complete link method tends to produce a large number of small and compact clusters, while the single link method is known to result in few “elongated” clusters with large number of points. A compromise between the two is the minimum-variance distance, or ward distance [22]. This distance is defined as
$$ d(i,j)=\frac{n_{i}n_{j}}{n_{i}+n_{j}}\left\|\mathbf{c_{i}}-\mathbf{c_{j}}\right\|^{2} $$
((11))
where n
k
and c
k
are the cardinality and the centroid of cluster C
k
, respectively. It has been shown in [17] that this approach merges the two clusters that lead to the smallest increase in the overall variance.
4.3 Ensemble HMM initialization and training
The previous clustering step results in K clusters, each comprised of potentially similar sequences. Each cluster is then used to learn an HMM, resulting in an ensemble of K HMMs. Let N
k
denote the number of sequences assigned to the same cluster k. Since our clustering step did not use class labels, clusters may include sequences from different classes. Let \(N_{k}^{(c)}\) be the number of sequences in cluster k that belong to class c, such that \(\sum _{c=1}^{C}{N_{k}^{(c)}}=N_{k}\). For instance, for the landmine example, if we let c=1 denote the class of mines and c=0 denote the class of clutter, \(N_{k}^{(1)}\) would be the number of mines assigned to cluster k.
The next step of our approach consists of learning a set of HMMs that reflect the diversity of the training data. Since a cluster contains a set of similar sequences, and each cluster may include observations from different classes, we learn one HMM model \(\left \{\lambda _{k}^{(c)}\right \}\) for each set of sequences assigned to class c within cluster k. Let \(\mathbb {O}_{k}^{(c)}=\left \{O_{r}^{(c)},y_{r}^{(c)}\right \}_{r=1}^{N_{k}^{(c)}}\) be the set of sequences partitioned into cluster k that belong to class c and let \(\left \{\lambda _{r}^{(c)}\right \}_{r=1}^{N_{k}^{(c)}}\) be their corresponding individual HMM models, c∈{1,⋯,C}.
For each cluster, we devise one of the following optimized training methods based on the cluster’s size and homogeneity.
-
Clusters dominated by sequences from only one class: In this case, we learn only one model for this cluster. The sequences within this cluster are presumably similar and belong to the same ground truth class, denoted C
i
. We assume that this cluster is a representative of that particular dominating class. It is expected that the class conditional posterior probability is uni-modal and peaks around the MLE of the parameters. Thus, a maximum likelihood estimation would result in an HMM that best fits this particular class. For these reasons, we use the standard Baum-Welch re-estimation procedure [21]. Let K
1 be the number of homogenous clusters that fit into this category and let \(\left \{\lambda _{i}^{(C_{i})},i=1,\cdots,K_{1}\right \}\) denote the set of BW-trained models.
-
Clusters with a mixture of observations belonging to different classes: In this case, it is expected that the posterior distribution of the classes is multi-modal. Thus, we need to learn one model for each class represented in this cluster. The MLE approach is not adequate, and more discriminative learning techniques such as genetic algorithms [23] or simulated annealing optimization [24] are needed to address the multimodality. In our work, we build a model for each class within the cluster. We focus on finding the class boundaries within the posteriors rather than trying to approximate a joint posterior probability. Thus, the models’ parameters are jointly optimized to minimize the overall misclassification error using a discriminative learning approach [25]. Let K
2 be the number of mixed clusters that fit into this category and let \(\left \{\lambda _{j}^{(c)},j=1,\cdots,K_{2},c=1, \cdots,C\right \}\) be the set of MCE-trained models.
-
Clusters containing a small number of sequences: The MLE and MCE learning approaches need a large number of data points to give robust estimates of the model parameters. Thus, when a cluster has few samples, the above approaches may not be reliable. Ignoring these clusters is not a good option as they may contain information about sequences with distinctive characteristics. The Bayesian training framework [26], on the other hand, is suitable to learn model parameters using a small number of training sequences. Specifically, we select only the dominating class for this cluster and learn a single model using a variational Bayesian approach [26] to approximate the class conditional posterior distribution. Let K
3 be the number of small clusters that fit into this category and let \(\left \{\lambda _{k}^{(C_{k})}, k=1,\cdots,K_{3}\right \}\) denote the set of Bayesian-trained models.
To summarize, for each homogenous cluster i, we define one model \(\lambda _{i}^{(C_{i})}\), i=1,⋯,K
1, for the dominating class C
i
. For mixed clusters, we define C models per cluster: \(\lambda _{j}^{(c)}\), c=1…C, j=1,⋯,K
2. For each small cluster, we define one model \(\lambda _{k}^{(C_{k})}\) for the dominating class C
k
. The ensemble HMM mixture is defined as \(\left \{\lambda _{k}^{(c)}\right \}\), where k∈{1,⋯,K}, and c=C
k
if cluster k is dominated by sequences labeled with class C
k
, and c∈{1⋯,C} if cluster k is a mixed cluster.
For simplicity, we assume that all models \(\lambda _{k}^{(c)}\) have a fixed number of states N. For each model \(\lambda _{k}^{(c)}\), the initialization step consists of assigning the priors, the initial states transition probabilities, and the states parameters (initial means and initial emission probabilities) using observations \(O_{r}^{(c)}\) and their respective individual models \(\lambda _{r}^{(c)}\), \(r \in \left \{1, \cdots, N_{k}^{(c)}\right \}\). In particular, the initial values for the priors and the state transition probabilities are obtained by averaging, respectively, the priors and the state transition probabilities of the individual models \(\lambda _{r}^{(c)}, r \in \left \{1,\cdots,N_{k}^{(c)}\right \}\). The initialization of the emission probabilities in each state, \(b_{n}^{(k,c)}\), depends on whether the HMM is discrete or continuous.
-
Discrete HMM (DHMM): the state representatives and the codebook of model \(\lambda _{k}^{(c)}\) are obtained by partitioning and quantizing the observations \(\mathbb {O}_{k}^{(c)}\). First, sequences from cluster k that belong to class c, \(O_{r}^{(c)}\), are “unrolled” to form a vector of observations U
(k,c) of length \(N_{k}^{(c)}T\). The state representatives, \(s_{n}^{(k,c)}\), are obtained by clustering U
(k,c) into N clusters and taking the centroid of each cluster as the state representative. Similarly, the codebook \(\textbf {V}^{(k,c)}=\left [v_{1}^{(k,c)},\cdots,v_{M}^{(k,c)}\right ]\) is obtained by clustering U
(k,c) into M clusters. For each symbol \(v_{m}^{(k,c)}\), the membership in each state \(s_{n}^{(k,c)}\) is computed using
$$ b_{n}^{(k,c)}(m) = \frac{\frac{1}{\|v_{m}^{(k,c)} -s_{n}^{(k,c)}\|}}{\sum_{l=1}^{N}\frac{1}{\left\|v_{m}^{(k,c)}-s_{l}^{(k,c)} \right\|}}, 1 \leq m \leq M. $$
((12))
To satisfy the requirement \(\sum _{m=1}^{M}b_{n}^{(k,c)}(m)=1\), we scale the values by:
$$ b_{n}^{(k,c)}(m) \longleftarrow \frac{b_{n}^{(k,c)}(m)}{\sum_{l=1}^{M}b_{n}^{(k,c)}(l)} $$
((13))
-
Continuous HMM (CHMM): we assume that each state has N
g
Gaussian components. For each model \(\lambda _{k}^{(c)}\), as in the discrete case, we define a vector of observations, U
(k,c). First, U
(k,c) is partitioned into N clusters and the center of cluster n is taken as state \(s_{n}^{(k,c)}\). Let \(\textbf {U}_{n}^{(k,c)}\) be the observations assigned to cluster n. Next, we partition \(\textbf {U}_{n}^{(k,c)}\) into N
g
clusters using the k-means algorithm [27]. The mean of each component, \(\mu _{n}^{(k,c,g)}\), is the center of one of the resulting clusters, and the covariance, \(\Sigma _{n}^{(k,c,g)}\), is estimated using the observations that belong to that same cluster. If we denote by \(\textbf {U}_{n}^{(k,c,g)}\) the observations that belong to component g of state \(s_{n}^{(k,c)}\), the parameters of \(\lambda _{k}^{(c)}\) are computed using
$$ \mu_{n}^{(k,c,g)} = \text{mean}\left\{\textbf{U}_{n}^{(k,c,g)}\right\}, 1 \leq n \leq N, 1 \leq g \leq N_{g}. $$
((14))
$$ \Sigma_{n}^{(k,c,g)} = \text{covariance} \left\{\textbf{U}_{n}^{(k,c,g)}\right\}, 1 \leq n \leq N, 1 \leq g \leq N_{g}. $$
((15))
For both the discrete and continuous cases, any clustering algorithm, such as the K-means [27] or the fuzzy c-means [28], could be used to identify the states, codebook, or the multiple components. After initialization, we use one of the training schemes described earlier, to update \(\lambda _{k}^{(c)}\) parameters using the respective observations \(\mathbb {O}_{k}^{(c)}\), k∈{1,⋯,K}, c∈{1,⋯,C}. As mentioned earlier, for homogenous clusters, BW training results in one model λ
BW per cluster; for mixed clusters, MCE training results in C models per cluster, \(\lambda ^{MCE}_{c}, c=1 \ldots C\); and for small clusters, variational Bayesian learning results in one model per cluster, λ
VB. The output of Baum-Welch- and VB-trained cluster models is P
r(O|λ
k
) while the output of the MCE-trained cluster models is \(\max _{c}{Pr\left (O|\lambda _{k,c}^{MCE}\right)}\).
4.4 Decision level fusion
The partial confidence values of the different models need to be combined into a single confidence value. Let \(\Lambda =\left \{\lambda ^{BW}_{i},\lambda ^{MCE}_{j},\lambda ^{VB}_{k}\right \}\) be the resulting mixture model composed of a total of K models, K=K
1+K
2+K
3.
Let F(k,r)= logP
r(O
r
|λ
k
),1≤r≤R,1≤k≤K, be the log-likelihood matrix obtained by testing the R training sequences with the K models. Each column f
r
of matrix F represents the feature vector of each sequence in the decision space (recall that f
r
is a K-dimensional vector while O
r
is a sequence of vector observations of length T). In other words, each column represents the confidences assigned by the K models to each sequence r. Therefore, the set of sequences \(\mathbb {O}=\{O_{r},y_{r}\}_{r=1}^{R}\) is mapped to a confidence space \(\{\textbf {f}_{r}, y_{r}\}_{r=1}^{R}\). Finally, a combination function, \(\mathbb {H}\), takes all the f
r
’s as input and outputs the final decision. The general framework for fusing the K outputs is highlighted in Algorithm ??.
Several decision level fusion techniques such as simple algebraic [29], artificial neural networks (ANN) [30], and hierarchical mixture of experts (HME) [31] can be used. In our work, we use an ANN with a single-layer perceptron and no hidden layers. The ANN weights are learned from the labeled training data using the backpropagation algorithm [32].
The architecture of the proposed eHMM is summarized in Fig. 1. It is composed of four main components: similarity matrix computation, relational clustering, adaptive training scheme, and decision level fusion. To test a new sequence, the outputs of the different models are aggregated into a single confidence value using Algorithm ??.