 Research
 Open Access
 Published:
Ensemble hidden Markov models with application to landmine detection
EURASIP Journal on Advances in Signal Processing volume 2015, Article number: 75 (2015)
Abstract
We introduce an ensemble learning method for temporal data that uses a mixture of hidden Markov models (HMM). We hypothesize that the data are generated by K models, each of which reflects a particular trend in the data. The proposed approach, called ensemble HMM (eHMM), is based on clustering within the loglikelihood space and has two main steps. First, one HMM is fit to each of the N individual training sequences. For each fitted model, we evaluate the loglikelihood of each sequence. This results in an NbyN loglikelihood distance matrix that will be partitioned into K groups using a relational clustering algorithm. In the second step, we learn the parameters of one HMM per cluster. We propose using and optimizing various training approaches for the different K groups depending on their size and homogeneity. In particular, we investigate the maximum likelihood (ML), the minimum classification error (MCE), and the variational Bayesian (VB) training approaches. Finally, to test a new sequence, its likelihood is computed in all the models and a final confidence value is assigned by combining the models’ outputs using an artificial neural network. We propose both discrete and continuous versions of the eHMM.
Our approach was evaluated on a realworld application for landmine detection using groundpenetrating radar (GPR). Results show that both the continuous and discrete eHMM can identify meaningful and coherent HMM mixture components that describe different properties of the data. Each HMM mixture component models a group of data that share common attributes. These attributes are reflected in the mixture model’s parameters. The results indicate that the proposed method outperforms the baseline HMM that uses one model for each class in the data.
Introduction
Detection and removal of buried landmines is a worldwide humanitarian and military problem. The latest statistics [1] show that in 2012, a total of 3618 casualties from mines were recorded in 62 countries, the vast majority (78 %) of casualties were civilians. Detection and removal of landmines is therefore a significant problem and in recent years has attracted several researchers. One challenge in landmine detection lies in plastic or low metal mines that are difficult to detect by traditional metal detectors. Varieties of sensors have been proposed or are under investigation for landmine detection. Groundpenetrating radar (GPR) offers the promise of detecting landmines with little or no metal content. Unfortunately, landmine detection via GPR has proven to be a difficult problem [2, 3]. Although systems can achieve high detection rates, they have done so at the expense of high false alarm rates. The key challenge to mine detection technology lies in achieving a high rate of mine detection while maintaining a low false alarm rate. The performance of a mine detection system is therefore commonly measured by a receiver operating characteristics (ROC) curve that specifies the rate of true detection versus the rate of false alarm.
To improve the overall ROC of the landmine detection system, several algorithms have been introduced in the last decade. These algorithms use methods such as fuzzy logic [4], hidden Markov models [5–7], nearest neighbor classifiers [8, 9], support vector machines [10], or random forest [11] to assign a confidence that a mine is present at a point.
In [5, 6], hidden Markov modeling was proposed for detecting both metal and nonmetal mine types using data collected by a moving vehiclemounted GPR system. These initial applications have proved that HMM techniques are feasible and effective for landmine detection. The initial work relied on simple gradient edge features. Subsequent work used an edge histogram descriptor (EHD) approach to extract features from the original GPR signatures. The baseline HMM classifier consists of two HMM models, one for mine and one for background. The mine (background) model captures the characteristics of the mine (background) signatures. The model initialization and subsequent training are based on global averaging over the training data corresponding to each class.
Most subsequent published works in the area of landmine detection using HMMs focused on featurelevel fusion [12] and/or modellevel fusion [13–15]. All of these methods still use a single model for each class. In this paper, we argue that a single model is not sufficient to capture the intraclass variability. In the context of landmine detection, variations in the class of mines may be caused by the different mine types, burial depth, soil type, and moisture. Similarly, background signatures may exhibit large variations due to different soil conditions and data preprocessing techniques. To generalize the HMM approach, we identify the variations within each class in an unsupervised manner and use multiple models to account for the intraclass variations.
The proposed approach consists of the construction of a mixture of HMMs to cover the diversity of the training data. This approach, called ensemble of hidden Markov models (eHMM), has four main components: similarity matrix computation, relational clustering, adaptive training scheme, and decision level fusion. These components are summarized by the block diagram in Fig. 1 and will be described in section 4.
The remainder of this paper is organized as follows. Section 2 provides background material on hidden Markov models. Section 3 highlights the motivations for adopting multiple models in our approach. Section 4 outlines the eHMM architecture and describes its different components. Section 5 reports the experimental results of our eHMM approach on large GPR collections and compare them to those of the baseline HMM detector. Finally, conclusions are provided in Section 6.
Background
Hidden Markov models
An HMM is a model of a doubly stochastic process that produces a sequence of random observation vectors at discrete times according to an underlying Markov chain. At each observation time, the Markov chain may be in one of N states {s _{1},…,s _{ N }} and, given that the chain is in a certain state, there are probabilities of moving to other states. These probabilities are called transition probabilities. Let T be the length of the observation sequence (i.e., number of time steps), let O={O _{1},…,O _{ T }} be the observation sequence, and let Q={q _{1},…,q _{ T }} be the state sequence. The compact notation
is generally used to indicate the complete parameter set of the HMM model. In (1), A= [a _{ ij }] is the state transition probability matrix, where a _{ ij }=P r(q _{ t }=jq _{ t−1}=i) for i,j=1,…,N; π=[π _{ i }], where π _{ i }=P r(q _{1}=s _{ i }) are the initial state probabilities; and B=b _{ i }(O _{ t }),i=1,…,N, where b _{ i }(O _{ t })=P r(O _{ t }q _{ t }=i) is the observation probability distribution in state i.
An HMM is called continuous if the observation probability density functions are continuous and discrete otherwise. In the case of the discrete HMM, the observation vectors are commonly quantized into a finite set of symbols, {v _{1},v _{2},…,v _{ M }}, called the codebook. Each state is represented by a discrete probability density function and each symbol has an associated probability of occurring given that the system is in a given state. In other words, B becomes a simple set of fixed probabilities for each state. That is, b _{ i }(O _{ t })=b _{ i }(k)=P r(v _{ k }q _{ t }=i), where v _{ k } is the nearest codebook symbol to O _{ t }.
Given the form of the hidden Markov model defined in (1), Rabiner [16] defines three key problems of interest that must be solved for the model to be useful in realworld applications: (i) the classification problem; (ii) the problem of finding an optimal state sequence; and (iii) the problem of estimating the model parameters.
The classification problem involves computing the probability of an observation sequence O={O _{1},O _{2},…,O _{ T }} given a model λ, i.e, P r(Oλ). This probability is computed efficiently using the forward–backward procedure [16].
In most applications, it often turns out that computing an optimal state sequence is more useful than P r(Oλ). There are several possible optimality criteria. One that is particularly useful is to maximize P r(O,Qλ) over all possible state sequences Q. The Viterbi algorithm [17] is an efficient and formal technique for finding this optimal state sequence and its probability.
The third problem in building an HMM is the training problem: how does one estimate the parameters of the model? The problem is difficult because there are several levels of estimation required in an HMM. First, the states themselves must be estimated. Then, the model parameters λ=(A,B,π) need to be estimated. For the discrete HMM, first the codebook is determined, usually using the Kmeans algorithm [18], or other vector quantization algorithms. Then, the parameters (A,B,π) are estimated iteratively using the BaumWelch learning algorithm [19].
Baseline HMM classifier for landmine detection
The baseline HMM classifier for GPRbased landmine detection was first introduced in [5]. It consists of two HMM models, one for mines and one for the background. Each model has four states and produces a probability value by backtracking through model states using the Viterbi algorithm [17]. The mine model, λ ^{m}, is designed to capture the spatial distribution of the features. This model assumes that mine signatures have a hyperbolic shape comprised of a succession of rising, horizontal, and falling edges with variable duration in each state. The beginning and the end of the observation vectors correspond typically to a nonedge (or background) state. The background model, λ ^{b}, is needed to capture the background and clutter characteristics. No prior information or assumptions are used for this model.
The architecture of the baseline HMM classifier is shown in Fig. 2. Full details of the model’s initialization and training can be found in [5]. The probability value produced by the mine (background) model can be thought of as an estimate of the probability of the observation sequence given that there is a mine (background) present.
The confidence value assigned to each observation sequence, Conf(O), depends on: (1) the probability assigned by the mine model, P r(Oλ ^{m}); (2) the probability assigned by the background model, P r(Oλ ^{c}); and (3) the optimal state sequence. Thus,
where #{s _{ t }=1,t=1,⋯,T} corresponds to the number of observations assigned to the background state (state 1). T _{max} is defined experimentally based on the shortest mine signature. Equation (2) ensures that sequences with a large number of observations assigned to state 1 are considered nonmines.
Extensions to the baseline HMM for landmine detection
In an effort to improve performance and generalization, several extensions to the baseline HMM have been proposed. For instance, in [12, 13], the authors proposed the multistream HMM (MSHMM) that combines multiple sets of features. An optimal weight for each feature was learned in the training phase. In [14], maximum likelihood (ML) and minimization of classification error (MCE) learning methods were derived for the MSHMM. In [15], HMMs with stickbreaking priors (SBHMM) [20] were employed to learn the number of HMM states in the baseline HMM landmine detector. This approach relies on a variational Bayesian learning technique in lieu of standard BW training.
Motivations
The baseline HMM represents each class by a single model learned from all the observations within that class. The goal is to generalize from all the training data in order to classify unseen observations. However, for complex classification problems with large intraclass variations, combining observations with different characteristics to learn one model might lead to too much averaging thus, lose the discriminating characteristics of the observations.
To illustrate this problem, we use the example of detecting buried landmines using GPR sensors^{1}. In this case, the training data consists of a set of N GPR alarms labeled as mines (class 1) or clutter (class 0). The goal is to generalize from the training data in order to classify unlabeled GPR signatures. In Fig. 3, we show three groups of mines with different signature strengths. It is obvious that grouping all of these signatures, to learn a single model, would lead to poor generalization. Similarly, the false alarms could have significant variations as they are caused by different clutter objects and varied environment conditions. These issues are more acute when data are collected by multiple sensors and/or using various features.
Consequently, learning a set of models that reflect different characteristics of the observations might be more beneficial than using one global model for each class. This is typical in many classifiers such as the KNN [9], which uses different prototypes for each class, and the SVM [10], which uses multiple support vectors.
In this paper, we develop a new approach that replaces the twomodel classifier with one that includes multiple models for each class. For instance, each group of signatures in Fig. 3 would be used to learn a different model. Our approach aims to capture the characteristics of the observations that would be lost under averaging in the twomodel case.
We hypothesize that under realistic conditions, the data are generated by multiple models. The proposed approach, called ensemble HMM (eHMM), attempts to partition the training data within the loglikelihood space and identify multiple clusters in an unsupervised manner. Depending on each cluster’s homogeneity and size, an appropriate training scheme is applied to learn the corresponding HMM parameters. The resulting K HMMs are then aggregated through a decision level fusion component to form a descriptive model for the data.
Ensemble HMM architecture
Let \(\mathbb {O}=\left \{O_{r},y_{r}\right \}_{r=1}^{R}\) be a set of R labeled sequences of length T where \(O_{r}=\left \{O_{r}^{(1)},\cdots,O_{r}^{(T)}\right \}\) and y _{ r }∈{1,⋯,C} is the label (class) of sequence O _{ r }. First, we need to identify subgroups of observations that have common patterns. Ground truth information could not be used for this task as it is insufficient and unreliable. For instance, a large deep buried mine can have a signature similar to a small shallow buried mine. Furthermore, the same mine buried at the same depth in soil with different properties may have different signatures. Thus, the partitioning needs to be done in an unsupervised way, i.e., regardless of the observation’s labels and the limited ground truth information. In our approach, we use unsupervised learning to cluster the set of all observations, \(\mathbb {O}\), into subgroups of “similar” observations. The first step in this approach is to define a measure of similarity between two observations.
Similarity between observations in the loglikelihood space
4.1.1 Fitting individual models to sequences
Initially, each sequence in the training data, O _{ r }, 1≤r≤R is used to learn an HMM model λ _{ r }. Even though using only one sequence of observations to learn an HMM might lead to overfitting, this technique is only an intermediate step that aims to capture the characteristics of each sequence. The produced HMM model is meant to give a maximal description of each sequence, and therefore, overfitting is not an issue in this context. In fact, it is desired that the model perfectly fits the observation sequence. In this case, the likelihood of each sequence with respect to its corresponding model is expected to be higher than those with respect to the remaining models.
Let \(\left \{\lambda _{r}^{(0)}\right \}_{r=1}^{R}\) be the set of initial models and let \(s_{n}^{(r)}, \ 1 \leq n \leq N\), be the representative of each state in \(\lambda _{r}^{(0)}\). Each model has N states. First, the model states can be assigned to the sequence observations either heuristically, using domain knowledge, or automatically by clustering the sequence observations into N clusters. In our approach, we use the latter and we define the states’ means and observations as the center and elements of each resulting cluster, respectively. Consequently, the transition matrix and the initial probabilities of \(\lambda _{r}^{(0)}\) are set according to the aforementioned associations. For the emission probabilities, the initialization differs whether we use the discrete or continuous HMM.
For the discrete case, the codewords {v _{1},⋯v _{ M }} of the initial individual DHMM model are the actual observations of the sequence {O _{1},⋯O _{ T }}. The emission probability of each codeword in each state is inversely proportional to their distance to the mean of that state. We use
To satisfy the requirement that \(\sum _{m=1}^{M}b_{n}(m)=1\), we normalize the values using
In the continuous case, the emission probability density functions are modeled by mixtures of Gaussians. In the case of individual sequence models, as the number of observations is small, we use a single component mixture for each state. Thus, the observations belonging to each state are used to estimate the mean and covariance of that state’s component. We use
and
Then, the BaumWelch algorithm [21] is used to adapt the model parameters to each given observation. Let \(\{\lambda _{r}\}_{r=1}^{R}\) be the set of trained individual models.
Next, we need to define a measure that evaluates the similarity between pairs of observation sequences. While similarity between static data observations is straightforward and well defined, defining a similarity between observation sequences is more of a challenge. Within the context of HMM modeling, we consider two observation sequences similar if: (i) they fit each other’s models; and (ii) they have similar Viterbi optimal paths [17].
4.1.2 Loglikelihoodbased similarity
The loglikelihood, L(i,j), of sequence O _{ i } being generated from model λ _{ j } reflects the degree to which O _{ i } fits λ _{ j } and is defined as:
In (7), L can be computed using the forward–backward procedure mentioned in Section 2.1. When the loglikelihood value is high, it is likely that model λ _{ j } generated sequence O _{ i }. In this case, sequences O _{ i } and O _{ j } are expected to have common salient features and are considered to be similar. On the other hand, when the likelihood term is low, it is unlikely that model λ _{ j } generated the sequence O _{ i }. In this case, O _{ i } and O _{ j } are considered to be dissimilar. For each observation sequence O _{ r }, 1≤r≤R, we compute its likelihood in each model λ _{ p }, P r(O _{ r }λ _{ p }), for 1≤p≤R. This will result in an R×R loglikelihood matrix.
4.1.3 Pathmismatchbased penalty
The likelihoodbased similarity may not be always accurate. In fact, some observations can have high likelihood in a visually different model. This occurs when most of the elements of a sequence partially match only one or two of the states of the model. In this case, the observation sequence can have a high likelihood in the model but its optimal Viterbi path will deviate from the typical path. To alleviate this problem, we introduce a penalty term, P(i,j), to the loglikelihood measure that is related to the mismatch between the most likely sequence of hidden states of the test sequence (O _{ i }) and that of the generating sequence (O _{ j }), i.e.,
In (8), P(i,j) is the distance between the Viterbi optimal path, Q ^{(ji)}, of testing sequence O _{ i } with model λ _{ j }, and the Viterbi optimal path of testing sequence O _{ j } with model λ _{ j }, Q ^{(jj)}. In (8), D _{ Edit } is the “edit distance” [17], commonly used in string comparisons. The “edit distance” between two strings, say p and q, is defined as the minimum number of singlecharacter edit operations (deletions, insertions, and/or replacements) that would convert p into q. The Viterbi path mismatch term is intended to ensure that similar sequences have few mismatches in their corresponding Viterbi optimal paths. Since the Viterbi path is already available when using the forward–backward procedure for the likelihood computation, the penalty term does not require significant additional computation.
Finally, we define the similarity, S, between two sequences O _{ i } and O _{ j } by combining (7) and (8):
In (9), the mixing factor, α∈[0,1], is a tradeoff parameter between the loglikelihoodbased similarity and the Viterbipathmismatchbased dissimilarity. It is estimated experimentally by maximizing the intraclass similarity and minimizing the interclass similarity across the training data. A larger value of α corresponds to a dominant loglikelihoodbased similarity where the need for the penalty mismatch is not significant. A smaller α corresponds to a more significant path mismatch penalty.
Using (9) to compute the similarity between all pairs of observations results in a similarity matrix that is not symmetric. Thus, we use the following threestep symmetrization scheme to transform it into a pairwise distance matrix:
Pairwise distancebased clustering
The distance matrix, computed using (10), reflects the degree to which pairs of sequences are considered similar. The largest variation is expected to be between sequences from different classes. Other significant variations may exist within the same class, e.g., the groups of signatures shown in Fig. 3. Our goal is to identify the similar groups so that one model can be learned for each group. This task can be achieved using any relational clustering algorithm. In our work, we use the standard agglomerative hierarchical algorithm [18].
Agglomerative hierarchical clustering is a bottom–up approach that starts with each data point as a cluster. It then proceeds by merging the most similar clusters to produce a sequence of clusters. Several measures have been used to assess the similarity between clusters [18]. Examples include single link, complete link, average link, and ward distance. The complete link method tends to produce a large number of small and compact clusters, while the single link method is known to result in few “elongated” clusters with large number of points. A compromise between the two is the minimumvariance distance, or ward distance [22]. This distance is defined as
where n _{ k } and c _{ k } are the cardinality and the centroid of cluster C _{ k }, respectively. It has been shown in [17] that this approach merges the two clusters that lead to the smallest increase in the overall variance.
Ensemble HMM initialization and training
The previous clustering step results in K clusters, each comprised of potentially similar sequences. Each cluster is then used to learn an HMM, resulting in an ensemble of K HMMs. Let N _{ k } denote the number of sequences assigned to the same cluster k. Since our clustering step did not use class labels, clusters may include sequences from different classes. Let \(N_{k}^{(c)}\) be the number of sequences in cluster k that belong to class c, such that \(\sum _{c=1}^{C}{N_{k}^{(c)}}=N_{k}\). For instance, for the landmine example, if we let c=1 denote the class of mines and c=0 denote the class of clutter, \(N_{k}^{(1)}\) would be the number of mines assigned to cluster k.
The next step of our approach consists of learning a set of HMMs that reflect the diversity of the training data. Since a cluster contains a set of similar sequences, and each cluster may include observations from different classes, we learn one HMM model \(\left \{\lambda _{k}^{(c)}\right \}\) for each set of sequences assigned to class c within cluster k. Let \(\mathbb {O}_{k}^{(c)}=\left \{O_{r}^{(c)},y_{r}^{(c)}\right \}_{r=1}^{N_{k}^{(c)}}\) be the set of sequences partitioned into cluster k that belong to class c and let \(\left \{\lambda _{r}^{(c)}\right \}_{r=1}^{N_{k}^{(c)}}\) be their corresponding individual HMM models, c∈{1,⋯,C}.
For each cluster, we devise one of the following optimized training methods based on the cluster’s size and homogeneity.

Clusters dominated by sequences from only one class: In this case, we learn only one model for this cluster. The sequences within this cluster are presumably similar and belong to the same ground truth class, denoted C _{ i }. We assume that this cluster is a representative of that particular dominating class. It is expected that the class conditional posterior probability is unimodal and peaks around the MLE of the parameters. Thus, a maximum likelihood estimation would result in an HMM that best fits this particular class. For these reasons, we use the standard BaumWelch reestimation procedure [21]. Let K _{1} be the number of homogenous clusters that fit into this category and let \(\left \{\lambda _{i}^{(C_{i})},i=1,\cdots,K_{1}\right \}\) denote the set of BWtrained models.

Clusters with a mixture of observations belonging to different classes: In this case, it is expected that the posterior distribution of the classes is multimodal. Thus, we need to learn one model for each class represented in this cluster. The MLE approach is not adequate, and more discriminative learning techniques such as genetic algorithms [23] or simulated annealing optimization [24] are needed to address the multimodality. In our work, we build a model for each class within the cluster. We focus on finding the class boundaries within the posteriors rather than trying to approximate a joint posterior probability. Thus, the models’ parameters are jointly optimized to minimize the overall misclassification error using a discriminative learning approach [25]. Let K _{2} be the number of mixed clusters that fit into this category and let \(\left \{\lambda _{j}^{(c)},j=1,\cdots,K_{2},c=1, \cdots,C\right \}\) be the set of MCEtrained models.

Clusters containing a small number of sequences: The MLE and MCE learning approaches need a large number of data points to give robust estimates of the model parameters. Thus, when a cluster has few samples, the above approaches may not be reliable. Ignoring these clusters is not a good option as they may contain information about sequences with distinctive characteristics. The Bayesian training framework [26], on the other hand, is suitable to learn model parameters using a small number of training sequences. Specifically, we select only the dominating class for this cluster and learn a single model using a variational Bayesian approach [26] to approximate the class conditional posterior distribution. Let K _{3} be the number of small clusters that fit into this category and let \(\left \{\lambda _{k}^{(C_{k})}, k=1,\cdots,K_{3}\right \}\) denote the set of Bayesiantrained models.
To summarize, for each homogenous cluster i, we define one model \(\lambda _{i}^{(C_{i})}\), i=1,⋯,K _{1}, for the dominating class C _{ i }. For mixed clusters, we define C models per cluster: \(\lambda _{j}^{(c)}\), c=1…C, j=1,⋯,K _{2}. For each small cluster, we define one model \(\lambda _{k}^{(C_{k})}\) for the dominating class C _{ k }. The ensemble HMM mixture is defined as \(\left \{\lambda _{k}^{(c)}\right \}\), where k∈{1,⋯,K}, and c=C _{ k } if cluster k is dominated by sequences labeled with class C _{ k }, and c∈{1⋯,C} if cluster k is a mixed cluster.
For simplicity, we assume that all models \(\lambda _{k}^{(c)}\) have a fixed number of states N. For each model \(\lambda _{k}^{(c)}\), the initialization step consists of assigning the priors, the initial states transition probabilities, and the states parameters (initial means and initial emission probabilities) using observations \(O_{r}^{(c)}\) and their respective individual models \(\lambda _{r}^{(c)}\), \(r \in \left \{1, \cdots, N_{k}^{(c)}\right \}\). In particular, the initial values for the priors and the state transition probabilities are obtained by averaging, respectively, the priors and the state transition probabilities of the individual models \(\lambda _{r}^{(c)}, r \in \left \{1,\cdots,N_{k}^{(c)}\right \}\). The initialization of the emission probabilities in each state, \(b_{n}^{(k,c)}\), depends on whether the HMM is discrete or continuous.

Discrete HMM (DHMM): the state representatives and the codebook of model \(\lambda _{k}^{(c)}\) are obtained by partitioning and quantizing the observations \(\mathbb {O}_{k}^{(c)}\). First, sequences from cluster k that belong to class c, \(O_{r}^{(c)}\), are “unrolled” to form a vector of observations U ^{(k,c)} of length \(N_{k}^{(c)}T\). The state representatives, \(s_{n}^{(k,c)}\), are obtained by clustering U ^{(k,c)} into N clusters and taking the centroid of each cluster as the state representative. Similarly, the codebook \(\textbf {V}^{(k,c)}=\left [v_{1}^{(k,c)},\cdots,v_{M}^{(k,c)}\right ]\) is obtained by clustering U ^{(k,c)} into M clusters. For each symbol \(v_{m}^{(k,c)}\), the membership in each state \(s_{n}^{(k,c)}\) is computed using
$$ b_{n}^{(k,c)}(m) = \frac{\frac{1}{\v_{m}^{(k,c)} s_{n}^{(k,c)}\}}{\sum_{l=1}^{N}\frac{1}{\left\v_{m}^{(k,c)}s_{l}^{(k,c)} \right\}}, 1 \leq m \leq M. $$((12))To satisfy the requirement \(\sum _{m=1}^{M}b_{n}^{(k,c)}(m)=1\), we scale the values by:
$$ b_{n}^{(k,c)}(m) \longleftarrow \frac{b_{n}^{(k,c)}(m)}{\sum_{l=1}^{M}b_{n}^{(k,c)}(l)} $$((13)) 
Continuous HMM (CHMM): we assume that each state has N _{ g } Gaussian components. For each model \(\lambda _{k}^{(c)}\), as in the discrete case, we define a vector of observations, U ^{(k,c)}. First, U ^{(k,c)} is partitioned into N clusters and the center of cluster n is taken as state \(s_{n}^{(k,c)}\). Let \(\textbf {U}_{n}^{(k,c)}\) be the observations assigned to cluster n. Next, we partition \(\textbf {U}_{n}^{(k,c)}\) into N _{ g } clusters using the kmeans algorithm [27]. The mean of each component, \(\mu _{n}^{(k,c,g)}\), is the center of one of the resulting clusters, and the covariance, \(\Sigma _{n}^{(k,c,g)}\), is estimated using the observations that belong to that same cluster. If we denote by \(\textbf {U}_{n}^{(k,c,g)}\) the observations that belong to component g of state \(s_{n}^{(k,c)}\), the parameters of \(\lambda _{k}^{(c)}\) are computed using
$$ \mu_{n}^{(k,c,g)} = \text{mean}\left\{\textbf{U}_{n}^{(k,c,g)}\right\}, 1 \leq n \leq N, 1 \leq g \leq N_{g}. $$((14))$$ \Sigma_{n}^{(k,c,g)} = \text{covariance} \left\{\textbf{U}_{n}^{(k,c,g)}\right\}, 1 \leq n \leq N, 1 \leq g \leq N_{g}. $$((15))
For both the discrete and continuous cases, any clustering algorithm, such as the Kmeans [27] or the fuzzy cmeans [28], could be used to identify the states, codebook, or the multiple components. After initialization, we use one of the training schemes described earlier, to update \(\lambda _{k}^{(c)}\) parameters using the respective observations \(\mathbb {O}_{k}^{(c)}\), k∈{1,⋯,K}, c∈{1,⋯,C}. As mentioned earlier, for homogenous clusters, BW training results in one model λ ^{BW} per cluster; for mixed clusters, MCE training results in C models per cluster, \(\lambda ^{MCE}_{c}, c=1 \ldots C\); and for small clusters, variational Bayesian learning results in one model per cluster, λ ^{VB}. The output of BaumWelch and VBtrained cluster models is P r(Oλ _{ k }) while the output of the MCEtrained cluster models is \(\max _{c}{Pr\left (O\lambda _{k,c}^{MCE}\right)}\).
Decision level fusion
The partial confidence values of the different models need to be combined into a single confidence value. Let \(\Lambda =\left \{\lambda ^{BW}_{i},\lambda ^{MCE}_{j},\lambda ^{VB}_{k}\right \}\) be the resulting mixture model composed of a total of K models, K=K _{1}+K _{2}+K _{3}.
Let F(k,r)= logP r(O _{ r }λ _{ k }),1≤r≤R,1≤k≤K, be the loglikelihood matrix obtained by testing the R training sequences with the K models. Each column f _{ r } of matrix F represents the feature vector of each sequence in the decision space (recall that f _{ r } is a Kdimensional vector while O _{ r } is a sequence of vector observations of length T). In other words, each column represents the confidences assigned by the K models to each sequence r. Therefore, the set of sequences \(\mathbb {O}=\{O_{r},y_{r}\}_{r=1}^{R}\) is mapped to a confidence space \(\{\textbf {f}_{r}, y_{r}\}_{r=1}^{R}\). Finally, a combination function, \(\mathbb {H}\), takes all the f _{ r }’s as input and outputs the final decision. The general framework for fusing the K outputs is highlighted in Algorithm ??.
Several decision level fusion techniques such as simple algebraic [29], artificial neural networks (ANN) [30], and hierarchical mixture of experts (HME) [31] can be used. In our work, we use an ANN with a singlelayer perceptron and no hidden layers. The ANN weights are learned from the labeled training data using the backpropagation algorithm [32].
The architecture of the proposed eHMM is summarized in Fig. 1. It is composed of four main components: similarity matrix computation, relational clustering, adaptive training scheme, and decision level fusion. To test a new sequence, the outputs of the different models are aggregated into a single confidence value using Algorithm ??.
Application to landmine detection using groundpenetrating radar data
Data collections
The proposed eHMM was implemented and tested on GPR data collected with a NIITEK vehicle mounted GPR system [33] (see Fig. 4). This system collects 51 channels of data. Adjacent channels are spaced approximately five centimeters apart in the crosstrack direction, and sequences (or scans) are taken at approximately 1cm downtrack intervals. Each Ascan, that is, the measured waveform that is collected in one channel at one downtrack position, contains 416 time samples at which the GPR signal return is recorded. We often refer to the time index as depth although, since the radar wave is traveling through different media, this index does not represent a uniform sampling of depth. Thus, we model an entire collection of input data as a threedimensional matrix of sample values, S(z,x,y),z=1,⋯,416;x=1,⋯,51;y=1,⋯,N _{ S }, where N _{ S } is the total number of collected scans, and the indices z, x, and y represent depth, crosstrack position, and downtrack positions respectively. A collection of scans, forming a volume of data, is illustrated in Fig. 5.
Figure 6 displays several Bscans (sequences of Ascans) both downtrack (formed from a time sequence of Ascans from a single sensor channel) and crosstrack (formed from each channel’s response in a single sample). The surveyed object position is highlighted in each figure. The objects scanned are (a) a highmetal content antitank mine, (b) a lowmetal antipersonnel mine, and (c) a wood block.
Raw GPR data needs to be preprocessed and prescreened. Preprocessing includes groundlevel alignment and signal and noise background removal. Prescreening is needed to focus attention and identify regions with subsurface anomalies. For this step, we use the adaptive least mean squares (LMS) prescreener [34]. The LMS flags locations of interest utilizing a computationally inexpensive algorithm so that more advanced algorithms can be applied only on the small subsets of data flagged by the prescreener.
In our experiments, data sets are comprised of a variety of mine and background signatures. In particular, we use data collected from outdoor test lanes at three different locations. The first two locations, site 1 and site 2, were temperate regions with significant rainfall, whereas the third collection, site 3, was a desert region. The lanes are simulated roads with known mine locations. Multiple data collections were performed at each site at different dates. The statistics of these data sets are reported in Table 1. Data cubes of size (15 scans, 7 channels, 416 depths) were extracted from each scan position flagged by the prescreener and are presented to the classifier to discriminate between mines and false alarms.
Feature extraction
The goal of the feature extraction step is to transform original GPR data into a sequence of observation vectors. We use two types of features that have been proposed and used independently. Each feature represents a different interpretation of the raw data and aims at providing a good discrimination between mine and clutter signatures. These features are outlined in the following subsections.
5.2.1 EHD features
This feature is based on the edge histogram descriptor [9] (EHD) and characterizes edges in the spatial domain. The EHD captures the salient properties of the 3D alarms in a compact and translation invariant representation. It extracts edge histograms capturing the frequency of occurrence of edge orientations in the data associated with a ground position. Simple edge detector operators are used to identify edges and group them into five categories: vertical, horizontal, diagonal, antidiagonal, and isotropic (nonedge). Each Bscan position is then represented by a fivedimensional observation vector. Each dimension of this vector represents the percentage of pixels (in a small interval along the depth) that belong to each of the five edge categories.
5.2.2 Gabor features
Gabor features characterize edges in the frequency domain at multiple scales and orientations and are based on Gabor wavelets [7]. This feature is extracted by expanding the signature’s Bscan (depth vs. downtrack) using a bank of scale and orientation selective Gabor filters. Expanding a signal using Gabor filters provides a localized frequency description. In our experiments, we use a bank of filters tuned to the combination of three scales and four orientations. Each observation is then represented by a 12dimension feature vector.
Ensemble HMM implementation and results
In all experiments reported in this paper, we use a sixfold cross validation for each data collection \(\mathfrak {D_{l}}\), l∈{1,2,3}. For each fold, a subset of the data (\(\mathfrak {D_{l}}_{\textit {Trn}}\)) is used for training and the remaining data (\(\mathfrak {D_{l}}_{\textit {Tst}}\)) is used for testing. \(\mathfrak {O_{l}}_{\textit {Trn}}^{Feat}\) denotes the set of observation sequences extracted from dataset \(\mathfrak {D_{l}}\), using one of the feature extraction methods, “Feat” (EHD or Gabor).
The first step of the eHMM is the similarity matrix computation. This step requires fitting an individual HMM model for each sequence in the training data \(\mathfrak {O_{l}}_{\textit {Trn}}^{Feat}\). Figure 7 shows the loglikelihood and path mismatch penalty matrices for a training collection that has 521 mines and 1471 clutter signatures (first training fold of \(\mathfrak {D_{1}}\) using EHD features, \(\mathfrak {O_{1}}_{Trn1}^{EHD}\)). In these figures, the indices are rearranged so that the first entries correspond to the mine signatures and the latter ones correspond to nonmine signatures. As it can be seen, the matrices are composed mainly of four blocks. The diagonal blocks correspond to testing mine signatures in mine models and nonmine signatures in nonmine models, and the offdiagonal blocks correspond to testing mine signatures in nonmine models and nonmine signatures in mine models. In these figures, dark pixels correspond to small values of the loglikelihood or path mismatch penalty and bright pixels correspond to larger entries of the corresponding matrices. Note that in the case of the loglikelihood matrix in Fig. 7 a, the diagonal blocks are brighter than the offdiagonal blocks. This means that the signatures from the same class are more similar to each other than to signatures from different classes. Similarly, in the path mismatch penalty matrix of Fig. 7 b, the diagonal blocks are darker than the offdiagonal blocks. This means that when different mines are tested with each other models, the paths are similar. The above observations are trivial as alarms from the same class are expected to be more similar to each other than alarms from different classes. However, they could be used to validate our similarity (and penalty) measures in the loglikelihood space. A more important observation is that within each diagonal block, subblocks can be extracted. This is an indication of the existence of different clusters within the mines (and the clutter) themselves.
In the second step, the similarity matrix is transformed into a distance matrix, D, using (10). The hierarchical clustering algorithm [18] is then applied, using D with a fixed number of clusters K=10, to identify subcategories within the training data. For both the discrete and continuous versions, using any of the features and datasets, the eHMM clustering step successfully assigns groups of similar alarms into clusters. For instance, in Fig. 9, we show the hierarchical clustering results of the first crossvalidation fold of the eCHMM using the EHD features on dataset \(\mathfrak {D_{1}}\). As it can be seen in Fig. 9 a, we have a group of clutter dominated clusters (in brown) and a second group of clusters dominated by mines (in blue). In Fig. 8, we show sample signatures that belong to clusters 1, 6, and 10. As it can be seen from Figs. 8 and 9 a, cluster 1 has only clutter and clusters 6 and 10 are composed exclusively of mine alarms. The mines that belong to cluster 6 have typically strong mine signatures. These mines, as shown in Fig. 9 b, c, are typically mines with high metal content that are buried at shallow depths. The mines that belong to cluster 10 have weak GPR signatures. These mines, as shown in Fig. 9 b, c, are typically mines with weak signatures that are either low metal mines or mines buried at deep depths.
Additional details of the clusters’ contents per mine type and per burial depth are shown in Fig. 9 b, c. To summarize, the training data includes four homogeneous clusters (Clusters 6, 7, and 10 contain only mines and cluster 1 has only clutter). The remaining clusters (2, 3, 4, 5, 8, and 9) are mixed. Therefore, using the notation of Algorithm ??, we define our eHMM as:
In Table 2, we report the means and the weights of the components of each state of the BWtrained eCHMM model for cluster 6, \(\lambda _{6}^{(M)}\), as well as its transition probability matrix. Cluster 6 contains “typical” mines that have strongedge and nearperfect hyperbolic shape signatures with succession of states s _{1}, s _{2}, and s _{3}. Recall that s _{1}, s _{2}, and s _{3} correspond respectively to the rising (Dg), flat (Hz), and falling (Ad) edges within the mine signature. Therefore, all the components of s _{1} (resp. s _{3}) have their diagonal edge higher (resp. lower) than the antidiagonal one. Similarly, components of s _{2} have higher horizontal edge and comparable diagonal and antidiagonal edges. As it can be seen in the transition matrix of Table 2, the probability of staying in s _{1} (resp. s _{2}) is approximately three times (resp. two times) the probability of moving to s _{2} (resp. s _{3}).
Table 3 shows the BWtrained eCHMM model for cluster 10, \(\lambda _{10}^{(M)}\). Recall that cluster 10 contains only mine signatures that have a low metal content and/or are buried at 4" or deeper, as it can be seen in Fig. 9 c. Therefore, the alarms in cluster 10 are expected to have weak signatures and weak edge features. This could explain the large nonedge component of most of the states’ means components of \(\lambda _{10}^{(M)}\) reported in Table 3, compared to the nonedge components of \(\lambda _{6}^{(M)}\)’s states representatives. Nevertheless, the states representatives still characterize the hyperbolic shape of a typical mine signature, i.e., the succession of D g−H z−A d states. For instance, all s _{1} components means have their diagonal D dimension larger than the antidiagonal A dimension. For the transition matrix, we notice that \(\lambda _{10}^{(M)}\) is more stationary in s _{2}, with a probability of 0.89, compared to \(\lambda _{6}^{(M)}\). This means that, on average, sequences belonging to cluster 10 have a large number of observations with flat edge and fewer observations with strong diagonal or antidiagonal edges.
Figure 10 shows the scatter plot of the confidences assigned by \(\lambda _{6}^{(M)}\) and \(\lambda _{10}^{(M)}\) to all the training data. In this figure, we display clutter and mine signatures that belong to each cluster using different symbols and colors. Even though the two models are dominated by mine signatures, we see that not all confidence values are highly correlated. On one hand, some strong mine signatures, particularly those belonging to cluster 6, have high loglikelihoods in model \(\lambda _{6}^{(M)}\) and lower loglikelihoods in model \(\lambda _{10}^{(M)}\) (lower right side of the scatter plot, region R1). This can be attributed to the fact that cluster 6 contains mainly strong mines and is more likely to yield high loglikelihood when testing a strong mine signature. On the other hand, in region R2, the performance of \(\lambda _{10}^{(M)}\) is better as it gives higher likelihood values to the “weak” mines in that region, particularly those belonging to cluster 10. In fact, this result is expected because cluster 10 contains weak mine signatures.
The main conclusion that we can draw from the above example is that \(\lambda _{6}^{(M)}\) and \(\lambda _{10}^{(M)}\) are very different and characterize two distinct subsets of the training data. The standard HMM approach would combine all alarms to learn a single model for mines (weak and strong) and a single model for clutter.
In the final step, the eHMM mixture is combined using a singlelayer ANN. The ANN parameters are trained to fit the responses of the eHMM mixture models to the training data labels.
eHMM vs. baseline HMM results
In this section, we compare the performance of the proposed eHMM to the baseline HMM [5]. For the eHMM, we show the results using the ANN fusion and the hierarchical agglomerative clustering with K=20. In Fig. 11 a, b, we show the ROCs generated by the discrete versions of the eHMM and the baseline HMM on dataset \(\mathfrak {D_{1}}\), using EHD and Gabor features. Similarly, in Fig. 12 a, b, we report the ROCs generated by the continuous versions, i.e., the eCHMM and the baseline CHMM, on dataset \(\mathfrak {D_{1}}\). As it can be seen, in all the ROCs of Figs. 11 and 12, at a given false alarm rate (FAR), the eHMM has a better probability of detecting targets. For instance, in Fig. 11 a, at a FAR of 10 %, the eDHMM using EHD features successfully identifies 94 % of the mines while the baseline DHMM identifies only 87 % of the targets. At the same FAR of 10 %, the ROCs of Fig. 12 a show that the eCHMM successfully identifies 95 % of the targets while the baseline CHMM probability of detection is 85 %.
The results for all three datasets are summarized in terms of the Area Under ROC Curve (AUC) and are reported in Table 4. As it can be seen, in all experiments, the eHMM outperforms the baseline HMM.
Conclusions
In this work, we have proposed a novel ensemble HMM classification method that is based on clustering sequences in the loglikelihood space. The eHMM uses multiple HMM models and fuses them for final decision making. We hypothesized that the data are generated by multiple models. These different models reflect the fact that samples from the same class can have different characteristics resulting in large intraclass variability.
The eHMM, in its discrete and continuous versions, was implemented and evaluated using large collections of landmine GPR data. We examined the intermediate steps of the eHMM and compared its performance to the baseline HMM. Results on three GPR data collections show that the proposed method can identify meaningful and coherent HMM mixture models that describe different properties of the data. Each individual HMM characterizes a group of data that share common attributes. The experiments show that the proposed eHMM intermediate results are inline with the expected behavior. The results also indicate that, for both the continuous and discrete versions, the proposed method outperforms the baseline HMM that uses one model for each class in the data.
Endnote
^{1} The details of the landmine detection application using GPR signatures will be presented in section 5.
References
 1
Landmine Monitor Report, (2013). http://www.themonitor.org/.
 2
TR Witten, in SPIE Conf Detection and Remediation Technologies for Mines and Minelike Targets III. Present state of the art in groundpenetrating radars for mine detection (Orlando, FL, 1998), pp. 576–586.
 3
PD Gader, H Frigui, BN Nelson, G Vaillette, JM Keller, in SPIE Conf Detection and Remediation Technologies for Mines and Minelike Targets IV. New results in fuzzy set based detection of landmines with GPR (Orlando, FL, 1999), pp. 1075–1084.
 4
PD Gader, B Nelson, H Frigui, G Vaillette, JM Keller, Fuzzy logic detection of landmines with ground penetrating radar. Signal Process. Special Issue Fuzzy Logic Signal Process. 80, 1069–1084 (2000).
 5
PD Gader, M Mystkowski, Y Zhao, Landmine detection with ground penetrating radar using hidden Markov models. IEEE Trans. Geosci. Remote Sensing. 39, 1231–1244 (2001).
 6
H Frigui, DKC Ho, PD Gader, Realtime landmine detection with groundpenetrating radar using discriminative and adaptive hidden Markov models. EURASIP J. Appl. Signal Process. 12, 1867–1885 (2005).
 7
H Frigui, O Missaoui, PD Gader, in SPIE Conf. Detection and Remediation Technologies for Mines and Minelike Targets XII. Landmine detection using discrete hidden Markov models with Gabor features (Louisville, KY, USA, 2007).
 8
H Frigui, PD Gader, S Kotturu, in SPIE Conf. Detection and Remediation Technologies for Mines and Minelike Targets. Detection and discrimination of landmines in ground penetrating radar using an eigenmine and fuzzy membership function approach, (2004). doi:10.1109/TFUZZ.2008.2005249.
 9
H Frigui, PD Gader, in Proceedings of the IEEE International Conference on Fuzzy Systems. Detection and discrimination of land mines based on edge histogram descriptors and fuzzy knearest neighbors (Vancouver, BC, Canada, 2006).
 10
A Karem, A Fadeev, H Frigui, Gader, P, in Society of PhotoOptical Instrumentation Engineers (SPIE) Conference Series, 7664. Comparison of different classification algorithms for landmine detection using GPR, (2010), p. 2. doi:10.1117/12.852257.
 11
PA Torrione, KD Morton, R Sakaguchi, LM Collins, Histograms of oriented gradients for landmine detection in groundpenetrating radar data. Geosci. Remote Sens. IEEE Trans. 52(3), 1539–1550 (2014). doi:10.1109/TGRS.2013.2252016.
 12
O Missaoui, H Frigui, P Gader, in Geoscience and Remote Sensing Symposium (IGARSS), 2010 IEEE International. Model level fusion of edge histogram descriptors and Gabor wavelets for landmine detection with ground penetrating radar, (2010), pp. 3378–3381. doi:10.1109/IGARSS.2010.5650350.
 13
O Missaoui, H Frigui, P Gader, in Machine Learning and Applications, 2009. ICMLA ’09. International Conference On. Discriminative multistream discrete hidden Markov models, (2009), pp. 178–183. doi:10.1109/ICMLA.2009.121.
 14
O Missaoui, H Frigui, P Gader, Multistream continuous hidden Markov models with application to landmine detection. EURASIP J. Adv. Signal Process. 1 (2013). doi:10.1186/16876180201340.
 15
CR Ratto, KD Morton, LM Collins, PA Torrione, in Geoscience and Remote Sensing Symposium (IGARSS), 2011 IEEE International. A hidden Markov context model for GPRbased landmine detection incorporating stickbreaking priors, (2011), pp. 874–877. doi:10.1109/IGARSS.2011.6049270.
 16
LR Rabiner, in Proceedings of the IEEE. A tutorial on hidden Markov models and selected applications in speech recognition (San Francisco, CA, USA, 1989), pp. 257–286.
 17
S Theodoridis, K Koutroumbas, Pattern Recognition, Fourth Edition (Academic Press, Inc, Orlando, FL, USA, 2009).
 18
R Duda, P Hart, Pattern Classification and Scene Analysis (John Wiley and Sons, New York, 1973).
 19
LE Baum, T Petrie, Statistical Inference for Probability Functions of Finite State Markov Chains. Ann. Math. Stat. 37, 1554–1563 (1966).
 20
J Paisley, L Carin, Hidden Markov models with stickbreaking priors. Signal Process. IEEE Trans. 57(10), 3905–3917 (2009). doi:10.1109/TSP.2009.2024987.
 21
LE Baum, T Petrie, Statistical inference for probabilistic functions of finite state Markov chains. Ann. Math. Stat. 37(6), 1554–1563 (1966). doi:10.2307/2238772.
 22
JH Ward, Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58(301), 236–244 (1963).
 23
JH Holland, Adaptation in Natural and Artificial Systems (MIT Press, Cambridge, MA, USA, 1992).
 24
S Kirkpatrick, CD Gelatt, MP Vecchi, Optimization by Simulated Annealing. Science. 220(4598), 671–680 (1983). doi:10.1126/science.220.4598.671.
 25
BH Juang, W Chou, CH Lee, Minimum Classification Error Rate Methods for Speech Recognition. Trans. Speech Audio Process. 5(3), 257–265 (1997).
 26
DJC MacKay, Information Theory, Inference and Learning Algorithms (Cambrdige University Press, New York, 2003).
 27
AK Jain, RC Dubes, Algorithms for Clustering Data (Prentice Hall, Upper Saddle River, NJ, USA, 1988).
 28
JC Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms (Kluwer Academic Publishers, Norwell, MA, USA, 1981).
 29
G Fumera, F Roli, A theoretical and experimental analysis of linear combiners for multiple classifier systems. IEEE Trans. Pattern Anal. Mach. Intell. 27(6), 942–956 (2005). doi:10.1109/TPAMI.2005.109.
 30
DS Lee, SN Srihari, in ICDAR ’95: Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1). A theory of classifier combination: the neural network approach (IEEE Computer SocietyWashington, DC, USA, 1995), p. 42.
 31
MI Jordan, RA Jacobs, in NIPS. Hierarchies of adaptive experts (Denver, 1991), pp. 985–992.
 32
DE Rumelhart, GE Hinton, RJ Williams, Learning internal representations by error propagation, 318–362 (1986). ISBN:026268053X.
 33
KJ Hintz, in Proceedings of the SPIE Conference on Detection and Remediation Technologies for Mines and Minelike Targets IX. SNR improvements in Niitek ground penetrating radar (Orlando, FL, USA, 2004).
 34
PA Torrione, CS Throckmorton, LM Collins, Performance of an adaptive featurebased processor for a wideband ground penetrating radar system. Aerospace and Electronic Systems, IEEE Transactions on. 2. pp. 644,658, April 2006.
Acknowledgements
This work was supported in part by U.S. Army Research Office Grants Number W911NF1310066 and W911NF14 10589. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office, or the U.S. Government.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Hamdi, A., Frigui, H. Ensemble hidden Markov models with application to landmine detection. EURASIP J. Adv. Signal Process. 2015, 75 (2015). https://doi.org/10.1186/s1363401502608
Received:
Accepted:
Published:
Keywords
 Hidden Markov models
 Mixture models
 Landmine detection
 Groundpenetrating radar
 Clustering