Let \(\mathbb {O}=\left \{O_{r},y_{r}\right \}_{r=1}^{R}\) be a set of R labeled sequences of length T where \(O_{r}=\left \{O_{r}^{(1)},\cdots,O_{r}^{(T)}\right \}\) and y
_{
r
}∈{1,⋯,C} is the label (class) of sequence O
_{
r
}. First, we need to identify subgroups of observations that have common patterns. Ground truth information could not be used for this task as it is insufficient and unreliable. For instance, a large deep buried mine can have a signature similar to a small shallow buried mine. Furthermore, the same mine buried at the same depth in soil with different properties may have different signatures. Thus, the partitioning needs to be done in an unsupervised way, i.e., regardless of the observation’s labels and the limited ground truth information. In our approach, we use unsupervised learning to cluster the set of all observations, \(\mathbb {O}\), into subgroups of “similar” observations. The first step in this approach is to define a measure of similarity between two observations.
4.1 Similarity between observations in the loglikelihood space
4.1.1 4.1.1 Fitting individual models to sequences
Initially, each sequence in the training data, O
_{
r
}, 1≤r≤R is used to learn an HMM model λ
_{
r
}. Even though using only one sequence of observations to learn an HMM might lead to overfitting, this technique is only an intermediate step that aims to capture the characteristics of each sequence. The produced HMM model is meant to give a maximal description of each sequence, and therefore, overfitting is not an issue in this context. In fact, it is desired that the model perfectly fits the observation sequence. In this case, the likelihood of each sequence with respect to its corresponding model is expected to be higher than those with respect to the remaining models.
Let \(\left \{\lambda _{r}^{(0)}\right \}_{r=1}^{R}\) be the set of initial models and let \(s_{n}^{(r)}, \ 1 \leq n \leq N\), be the representative of each state in \(\lambda _{r}^{(0)}\). Each model has N states. First, the model states can be assigned to the sequence observations either heuristically, using domain knowledge, or automatically by clustering the sequence observations into N clusters. In our approach, we use the latter and we define the states’ means and observations as the center and elements of each resulting cluster, respectively. Consequently, the transition matrix and the initial probabilities of \(\lambda _{r}^{(0)}\) are set according to the aforementioned associations. For the emission probabilities, the initialization differs whether we use the discrete or continuous HMM.
For the discrete case, the codewords {v
_{1},⋯v
_{
M
}} of the initial individual DHMM model are the actual observations of the sequence {O
_{1},⋯O
_{
T
}}. The emission probability of each codeword in each state is inversely proportional to their distance to the mean of that state. We use
$$ b_{n}(m) = \frac{\frac{1}{\left\ v_{m}s_{n}\right\}}{\sum_{l=1}^{N}\frac{1}{\left\v_{m} s_{l} \right\}}, 1 \leq m \leq M, 1 \leq n \leq N. $$
((3))
To satisfy the requirement that \(\sum _{m=1}^{M}b_{n}(m)=1\), we normalize the values using
$$ b_{n}(m) \longleftarrow \frac{b_{n}(m)}{\sum_{l=1}^{M}b_{n}(l)}, 1 \leq m \leq M, 1 \leq n \leq N. $$
((4))
In the continuous case, the emission probability density functions are modeled by mixtures of Gaussians. In the case of individual sequence models, as the number of observations is small, we use a single component mixture for each state. Thus, the observations belonging to each state are used to estimate the mean and covariance of that state’s component. We use
$$ \mu_{n} = \text{mean} \left\{O_{t}  O_{t} \in s_{n}\right\}, 1 \leq n \leq N, $$
((5))
and
$$ \Sigma_{n} = \text{covariance} \left\{O_{t}  O_{t} \in s_{n}\right\}, 1 \leq n \leq N. $$
((6))
Then, the BaumWelch algorithm [21] is used to adapt the model parameters to each given observation. Let \(\{\lambda _{r}\}_{r=1}^{R}\) be the set of trained individual models.
Next, we need to define a measure that evaluates the similarity between pairs of observation sequences. While similarity between static data observations is straightforward and well defined, defining a similarity between observation sequences is more of a challenge. Within the context of HMM modeling, we consider two observation sequences similar if: (i) they fit each other’s models; and (ii) they have similar Viterbi optimal paths [17].
4.1.2 4.1.2 Loglikelihoodbased similarity
The loglikelihood, L(i,j), of sequence O
_{
i
} being generated from model λ
_{
j
} reflects the degree to which O
_{
i
} fits λ
_{
j
} and is defined as:
$$ \textbf{L}(i,j)=\log{Pr\left(O_{i}\lambda_{j}\right)}. $$
((7))
In (7), L can be computed using the forward–backward procedure mentioned in Section 2.1. When the loglikelihood value is high, it is likely that model λ
_{
j
} generated sequence O
_{
i
}. In this case, sequences O
_{
i
} and O
_{
j
} are expected to have common salient features and are considered to be similar. On the other hand, when the likelihood term is low, it is unlikely that model λ
_{
j
} generated the sequence O
_{
i
}. In this case, O
_{
i
} and O
_{
j
} are considered to be dissimilar. For each observation sequence O
_{
r
}, 1≤r≤R, we compute its likelihood in each model λ
_{
p
}, P
r(O
_{
r
}λ
_{
p
}), for 1≤p≤R. This will result in an R×R loglikelihood matrix.
4.1.3 4.1.3 Pathmismatchbased penalty
The likelihoodbased similarity may not be always accurate. In fact, some observations can have high likelihood in a visually different model. This occurs when most of the elements of a sequence partially match only one or two of the states of the model. In this case, the observation sequence can have a high likelihood in the model but its optimal Viterbi path will deviate from the typical path. To alleviate this problem, we introduce a penalty term, P(i,j), to the loglikelihood measure that is related to the mismatch between the most likely sequence of hidden states of the test sequence (O
_{
i
}) and that of the generating sequence (O
_{
j
}), i.e.,
$$ \mathbf{P}(i,j)=\mathbf{D_{Edit}}\left(Q^{(ji)}, Q^{(jj)}\right), \quad 1 \leq i,j \leq R. $$
((8))
In (8), P(i,j) is the distance between the Viterbi optimal path, Q
^{(ji)}, of testing sequence O
_{
i
} with model λ
_{
j
}, and the Viterbi optimal path of testing sequence O
_{
j
} with model λ
_{
j
}, Q
^{(jj)}. In (8), D
_{
Edit
} is the “edit distance” [17], commonly used in string comparisons. The “edit distance” between two strings, say p and q, is defined as the minimum number of singlecharacter edit operations (deletions, insertions, and/or replacements) that would convert p into q. The Viterbi path mismatch term is intended to ensure that similar sequences have few mismatches in their corresponding Viterbi optimal paths. Since the Viterbi path is already available when using the forward–backward procedure for the likelihood computation, the penalty term does not require significant additional computation.
Finally, we define the similarity, S, between two sequences O
_{
i
} and O
_{
j
} by combining (7) and (8):
$$ \textbf{S}(i,j)=\alpha\textbf{L}(i,j)(1\alpha)\textbf{P}(i,j). $$
((9))
In (9), the mixing factor, α∈[0,1], is a tradeoff parameter between the loglikelihoodbased similarity and the Viterbipathmismatchbased dissimilarity. It is estimated experimentally by maximizing the intraclass similarity and minimizing the interclass similarity across the training data. A larger value of α corresponds to a dominant loglikelihoodbased similarity where the need for the penalty mismatch is not significant. A smaller α corresponds to a more significant path mismatch penalty.
Using (9) to compute the similarity between all pairs of observations results in a similarity matrix that is not symmetric. Thus, we use the following threestep symmetrization scheme to transform it into a pairwise distance matrix:
$$ \left\{ \begin{array}{ll} 1.\quad \textbf{D}(i,j)=\textbf{S}(i,j) \quad 1 \leq i,j \leq R \\ 2.\quad \textbf{D}(i,i)=0, \quad 1 \leq i \leq R \\ 3.\quad \textbf{D}(i,j)=\max(\textbf{D}(i,j),\textbf{D}(j,i)),\quad 1 \leq i,j\leq R. \end{array} \right. $$
((10))
4.2 Pairwise distancebased clustering
The distance matrix, computed using (10), reflects the degree to which pairs of sequences are considered similar. The largest variation is expected to be between sequences from different classes. Other significant variations may exist within the same class, e.g., the groups of signatures shown in Fig. 3. Our goal is to identify the similar groups so that one model can be learned for each group. This task can be achieved using any relational clustering algorithm. In our work, we use the standard agglomerative hierarchical algorithm [18].
Agglomerative hierarchical clustering is a bottom–up approach that starts with each data point as a cluster. It then proceeds by merging the most similar clusters to produce a sequence of clusters. Several measures have been used to assess the similarity between clusters [18]. Examples include single link, complete link, average link, and ward distance. The complete link method tends to produce a large number of small and compact clusters, while the single link method is known to result in few “elongated” clusters with large number of points. A compromise between the two is the minimumvariance distance, or ward distance [22]. This distance is defined as
$$ d(i,j)=\frac{n_{i}n_{j}}{n_{i}+n_{j}}\left\\mathbf{c_{i}}\mathbf{c_{j}}\right\^{2} $$
((11))
where n
_{
k
} and c
_{
k
} are the cardinality and the centroid of cluster C
_{
k
}, respectively. It has been shown in [17] that this approach merges the two clusters that lead to the smallest increase in the overall variance.
4.3 Ensemble HMM initialization and training
The previous clustering step results in K clusters, each comprised of potentially similar sequences. Each cluster is then used to learn an HMM, resulting in an ensemble of K HMMs. Let N
_{
k
} denote the number of sequences assigned to the same cluster k. Since our clustering step did not use class labels, clusters may include sequences from different classes. Let \(N_{k}^{(c)}\) be the number of sequences in cluster k that belong to class c, such that \(\sum _{c=1}^{C}{N_{k}^{(c)}}=N_{k}\). For instance, for the landmine example, if we let c=1 denote the class of mines and c=0 denote the class of clutter, \(N_{k}^{(1)}\) would be the number of mines assigned to cluster k.
The next step of our approach consists of learning a set of HMMs that reflect the diversity of the training data. Since a cluster contains a set of similar sequences, and each cluster may include observations from different classes, we learn one HMM model \(\left \{\lambda _{k}^{(c)}\right \}\) for each set of sequences assigned to class c within cluster k. Let \(\mathbb {O}_{k}^{(c)}=\left \{O_{r}^{(c)},y_{r}^{(c)}\right \}_{r=1}^{N_{k}^{(c)}}\) be the set of sequences partitioned into cluster k that belong to class c and let \(\left \{\lambda _{r}^{(c)}\right \}_{r=1}^{N_{k}^{(c)}}\) be their corresponding individual HMM models, c∈{1,⋯,C}.
For each cluster, we devise one of the following optimized training methods based on the cluster’s size and homogeneity.

Clusters dominated by sequences from only one class: In this case, we learn only one model for this cluster. The sequences within this cluster are presumably similar and belong to the same ground truth class, denoted C
_{
i
}. We assume that this cluster is a representative of that particular dominating class. It is expected that the class conditional posterior probability is unimodal and peaks around the MLE of the parameters. Thus, a maximum likelihood estimation would result in an HMM that best fits this particular class. For these reasons, we use the standard BaumWelch reestimation procedure [21]. Let K
_{1} be the number of homogenous clusters that fit into this category and let \(\left \{\lambda _{i}^{(C_{i})},i=1,\cdots,K_{1}\right \}\) denote the set of BWtrained models.

Clusters with a mixture of observations belonging to different classes: In this case, it is expected that the posterior distribution of the classes is multimodal. Thus, we need to learn one model for each class represented in this cluster. The MLE approach is not adequate, and more discriminative learning techniques such as genetic algorithms [23] or simulated annealing optimization [24] are needed to address the multimodality. In our work, we build a model for each class within the cluster. We focus on finding the class boundaries within the posteriors rather than trying to approximate a joint posterior probability. Thus, the models’ parameters are jointly optimized to minimize the overall misclassification error using a discriminative learning approach [25]. Let K
_{2} be the number of mixed clusters that fit into this category and let \(\left \{\lambda _{j}^{(c)},j=1,\cdots,K_{2},c=1, \cdots,C\right \}\) be the set of MCEtrained models.

Clusters containing a small number of sequences: The MLE and MCE learning approaches need a large number of data points to give robust estimates of the model parameters. Thus, when a cluster has few samples, the above approaches may not be reliable. Ignoring these clusters is not a good option as they may contain information about sequences with distinctive characteristics. The Bayesian training framework [26], on the other hand, is suitable to learn model parameters using a small number of training sequences. Specifically, we select only the dominating class for this cluster and learn a single model using a variational Bayesian approach [26] to approximate the class conditional posterior distribution. Let K
_{3} be the number of small clusters that fit into this category and let \(\left \{\lambda _{k}^{(C_{k})}, k=1,\cdots,K_{3}\right \}\) denote the set of Bayesiantrained models.
To summarize, for each homogenous cluster i, we define one model \(\lambda _{i}^{(C_{i})}\), i=1,⋯,K
_{1}, for the dominating class C
_{
i
}. For mixed clusters, we define C models per cluster: \(\lambda _{j}^{(c)}\), c=1…C, j=1,⋯,K
_{2}. For each small cluster, we define one model \(\lambda _{k}^{(C_{k})}\) for the dominating class C
_{
k
}. The ensemble HMM mixture is defined as \(\left \{\lambda _{k}^{(c)}\right \}\), where k∈{1,⋯,K}, and c=C
_{
k
} if cluster k is dominated by sequences labeled with class C
_{
k
}, and c∈{1⋯,C} if cluster k is a mixed cluster.
For simplicity, we assume that all models \(\lambda _{k}^{(c)}\) have a fixed number of states N. For each model \(\lambda _{k}^{(c)}\), the initialization step consists of assigning the priors, the initial states transition probabilities, and the states parameters (initial means and initial emission probabilities) using observations \(O_{r}^{(c)}\) and their respective individual models \(\lambda _{r}^{(c)}\), \(r \in \left \{1, \cdots, N_{k}^{(c)}\right \}\). In particular, the initial values for the priors and the state transition probabilities are obtained by averaging, respectively, the priors and the state transition probabilities of the individual models \(\lambda _{r}^{(c)}, r \in \left \{1,\cdots,N_{k}^{(c)}\right \}\). The initialization of the emission probabilities in each state, \(b_{n}^{(k,c)}\), depends on whether the HMM is discrete or continuous.

Discrete HMM (DHMM): the state representatives and the codebook of model \(\lambda _{k}^{(c)}\) are obtained by partitioning and quantizing the observations \(\mathbb {O}_{k}^{(c)}\). First, sequences from cluster k that belong to class c, \(O_{r}^{(c)}\), are “unrolled” to form a vector of observations U
^{(k,c)} of length \(N_{k}^{(c)}T\). The state representatives, \(s_{n}^{(k,c)}\), are obtained by clustering U
^{(k,c)} into N clusters and taking the centroid of each cluster as the state representative. Similarly, the codebook \(\textbf {V}^{(k,c)}=\left [v_{1}^{(k,c)},\cdots,v_{M}^{(k,c)}\right ]\) is obtained by clustering U
^{(k,c)} into M clusters. For each symbol \(v_{m}^{(k,c)}\), the membership in each state \(s_{n}^{(k,c)}\) is computed using
$$ b_{n}^{(k,c)}(m) = \frac{\frac{1}{\v_{m}^{(k,c)} s_{n}^{(k,c)}\}}{\sum_{l=1}^{N}\frac{1}{\left\v_{m}^{(k,c)}s_{l}^{(k,c)} \right\}}, 1 \leq m \leq M. $$
((12))
To satisfy the requirement \(\sum _{m=1}^{M}b_{n}^{(k,c)}(m)=1\), we scale the values by:
$$ b_{n}^{(k,c)}(m) \longleftarrow \frac{b_{n}^{(k,c)}(m)}{\sum_{l=1}^{M}b_{n}^{(k,c)}(l)} $$
((13))

Continuous HMM (CHMM): we assume that each state has N
_{
g
} Gaussian components. For each model \(\lambda _{k}^{(c)}\), as in the discrete case, we define a vector of observations, U
^{(k,c)}. First, U
^{(k,c)} is partitioned into N clusters and the center of cluster n is taken as state \(s_{n}^{(k,c)}\). Let \(\textbf {U}_{n}^{(k,c)}\) be the observations assigned to cluster n. Next, we partition \(\textbf {U}_{n}^{(k,c)}\) into N
_{
g
} clusters using the kmeans algorithm [27]. The mean of each component, \(\mu _{n}^{(k,c,g)}\), is the center of one of the resulting clusters, and the covariance, \(\Sigma _{n}^{(k,c,g)}\), is estimated using the observations that belong to that same cluster. If we denote by \(\textbf {U}_{n}^{(k,c,g)}\) the observations that belong to component g of state \(s_{n}^{(k,c)}\), the parameters of \(\lambda _{k}^{(c)}\) are computed using
$$ \mu_{n}^{(k,c,g)} = \text{mean}\left\{\textbf{U}_{n}^{(k,c,g)}\right\}, 1 \leq n \leq N, 1 \leq g \leq N_{g}. $$
((14))
$$ \Sigma_{n}^{(k,c,g)} = \text{covariance} \left\{\textbf{U}_{n}^{(k,c,g)}\right\}, 1 \leq n \leq N, 1 \leq g \leq N_{g}. $$
((15))
For both the discrete and continuous cases, any clustering algorithm, such as the Kmeans [27] or the fuzzy cmeans [28], could be used to identify the states, codebook, or the multiple components. After initialization, we use one of the training schemes described earlier, to update \(\lambda _{k}^{(c)}\) parameters using the respective observations \(\mathbb {O}_{k}^{(c)}\), k∈{1,⋯,K}, c∈{1,⋯,C}. As mentioned earlier, for homogenous clusters, BW training results in one model λ
^{BW} per cluster; for mixed clusters, MCE training results in C models per cluster, \(\lambda ^{MCE}_{c}, c=1 \ldots C\); and for small clusters, variational Bayesian learning results in one model per cluster, λ
^{VB}. The output of BaumWelch and VBtrained cluster models is P
r(Oλ
_{
k
}) while the output of the MCEtrained cluster models is \(\max _{c}{Pr\left (O\lambda _{k,c}^{MCE}\right)}\).
4.4 Decision level fusion
The partial confidence values of the different models need to be combined into a single confidence value. Let \(\Lambda =\left \{\lambda ^{BW}_{i},\lambda ^{MCE}_{j},\lambda ^{VB}_{k}\right \}\) be the resulting mixture model composed of a total of K models, K=K
_{1}+K
_{2}+K
_{3}.
Let F(k,r)= logP
r(O
_{
r
}λ
_{
k
}),1≤r≤R,1≤k≤K, be the loglikelihood matrix obtained by testing the R training sequences with the K models. Each column f
_{
r
} of matrix F represents the feature vector of each sequence in the decision space (recall that f
_{
r
} is a Kdimensional vector while O
_{
r
} is a sequence of vector observations of length T). In other words, each column represents the confidences assigned by the K models to each sequence r. Therefore, the set of sequences \(\mathbb {O}=\{O_{r},y_{r}\}_{r=1}^{R}\) is mapped to a confidence space \(\{\textbf {f}_{r}, y_{r}\}_{r=1}^{R}\). Finally, a combination function, \(\mathbb {H}\), takes all the f
_{
r
}’s as input and outputs the final decision. The general framework for fusing the K outputs is highlighted in Algorithm ??.
Several decision level fusion techniques such as simple algebraic [29], artificial neural networks (ANN) [30], and hierarchical mixture of experts (HME) [31] can be used. In our work, we use an ANN with a singlelayer perceptron and no hidden layers. The ANN weights are learned from the labeled training data using the backpropagation algorithm [32].
The architecture of the proposed eHMM is summarized in Fig. 1. It is composed of four main components: similarity matrix computation, relational clustering, adaptive training scheme, and decision level fusion. To test a new sequence, the outputs of the different models are aggregated into a single confidence value using Algorithm ??.