 Research
 Open Access
Ensemble hidden Markov models with application to landmine detection
 Anis Hamdi^{1} and
 Hichem Frigui^{1}Email author
https://doi.org/10.1186/s1363401502608
© Hamdi and Frigui. 2015
 Received: 8 April 2015
 Accepted: 4 August 2015
 Published: 19 August 2015
Abstract
We introduce an ensemble learning method for temporal data that uses a mixture of hidden Markov models (HMM). We hypothesize that the data are generated by K models, each of which reflects a particular trend in the data. The proposed approach, called ensemble HMM (eHMM), is based on clustering within the loglikelihood space and has two main steps. First, one HMM is fit to each of the N individual training sequences. For each fitted model, we evaluate the loglikelihood of each sequence. This results in an NbyN loglikelihood distance matrix that will be partitioned into K groups using a relational clustering algorithm. In the second step, we learn the parameters of one HMM per cluster. We propose using and optimizing various training approaches for the different K groups depending on their size and homogeneity. In particular, we investigate the maximum likelihood (ML), the minimum classification error (MCE), and the variational Bayesian (VB) training approaches. Finally, to test a new sequence, its likelihood is computed in all the models and a final confidence value is assigned by combining the models’ outputs using an artificial neural network. We propose both discrete and continuous versions of the eHMM.
Our approach was evaluated on a realworld application for landmine detection using groundpenetrating radar (GPR). Results show that both the continuous and discrete eHMM can identify meaningful and coherent HMM mixture components that describe different properties of the data. Each HMM mixture component models a group of data that share common attributes. These attributes are reflected in the mixture model’s parameters. The results indicate that the proposed method outperforms the baseline HMM that uses one model for each class in the data.
Keywords
 Hidden Markov models
 Mixture models
 Landmine detection
 Groundpenetrating radar
 Clustering
1 Introduction
Detection and removal of buried landmines is a worldwide humanitarian and military problem. The latest statistics [1] show that in 2012, a total of 3618 casualties from mines were recorded in 62 countries, the vast majority (78 %) of casualties were civilians. Detection and removal of landmines is therefore a significant problem and in recent years has attracted several researchers. One challenge in landmine detection lies in plastic or low metal mines that are difficult to detect by traditional metal detectors. Varieties of sensors have been proposed or are under investigation for landmine detection. Groundpenetrating radar (GPR) offers the promise of detecting landmines with little or no metal content. Unfortunately, landmine detection via GPR has proven to be a difficult problem [2, 3]. Although systems can achieve high detection rates, they have done so at the expense of high false alarm rates. The key challenge to mine detection technology lies in achieving a high rate of mine detection while maintaining a low false alarm rate. The performance of a mine detection system is therefore commonly measured by a receiver operating characteristics (ROC) curve that specifies the rate of true detection versus the rate of false alarm.
To improve the overall ROC of the landmine detection system, several algorithms have been introduced in the last decade. These algorithms use methods such as fuzzy logic [4], hidden Markov models [5–7], nearest neighbor classifiers [8, 9], support vector machines [10], or random forest [11] to assign a confidence that a mine is present at a point.
In [5, 6], hidden Markov modeling was proposed for detecting both metal and nonmetal mine types using data collected by a moving vehiclemounted GPR system. These initial applications have proved that HMM techniques are feasible and effective for landmine detection. The initial work relied on simple gradient edge features. Subsequent work used an edge histogram descriptor (EHD) approach to extract features from the original GPR signatures. The baseline HMM classifier consists of two HMM models, one for mine and one for background. The mine (background) model captures the characteristics of the mine (background) signatures. The model initialization and subsequent training are based on global averaging over the training data corresponding to each class.
Most subsequent published works in the area of landmine detection using HMMs focused on featurelevel fusion [12] and/or modellevel fusion [13–15]. All of these methods still use a single model for each class. In this paper, we argue that a single model is not sufficient to capture the intraclass variability. In the context of landmine detection, variations in the class of mines may be caused by the different mine types, burial depth, soil type, and moisture. Similarly, background signatures may exhibit large variations due to different soil conditions and data preprocessing techniques. To generalize the HMM approach, we identify the variations within each class in an unsupervised manner and use multiple models to account for the intraclass variations.
The remainder of this paper is organized as follows. Section 2 provides background material on hidden Markov models. Section 3 highlights the motivations for adopting multiple models in our approach. Section 4 outlines the eHMM architecture and describes its different components. Section 5 reports the experimental results of our eHMM approach on large GPR collections and compare them to those of the baseline HMM detector. Finally, conclusions are provided in Section 6.
2 Background
2.1 Hidden Markov models
is generally used to indicate the complete parameter set of the HMM model. In (1), A= [a _{ ij }] is the state transition probability matrix, where a _{ ij }=P r(q _{ t }=jq _{ t−1}=i) for i,j=1,…,N; π=[π _{ i }], where π _{ i }=P r(q _{1}=s _{ i }) are the initial state probabilities; and B=b _{ i }(O _{ t }),i=1,…,N, where b _{ i }(O _{ t })=P r(O _{ t }q _{ t }=i) is the observation probability distribution in state i.
An HMM is called continuous if the observation probability density functions are continuous and discrete otherwise. In the case of the discrete HMM, the observation vectors are commonly quantized into a finite set of symbols, {v _{1},v _{2},…,v _{ M }}, called the codebook. Each state is represented by a discrete probability density function and each symbol has an associated probability of occurring given that the system is in a given state. In other words, B becomes a simple set of fixed probabilities for each state. That is, b _{ i }(O _{ t })=b _{ i }(k)=P r(v _{ k }q _{ t }=i), where v _{ k } is the nearest codebook symbol to O _{ t }.
Given the form of the hidden Markov model defined in (1), Rabiner [16] defines three key problems of interest that must be solved for the model to be useful in realworld applications: (i) the classification problem; (ii) the problem of finding an optimal state sequence; and (iii) the problem of estimating the model parameters.
The classification problem involves computing the probability of an observation sequence O={O _{1},O _{2},…,O _{ T }} given a model λ, i.e, P r(Oλ). This probability is computed efficiently using the forward–backward procedure [16].
In most applications, it often turns out that computing an optimal state sequence is more useful than P r(Oλ). There are several possible optimality criteria. One that is particularly useful is to maximize P r(O,Qλ) over all possible state sequences Q. The Viterbi algorithm [17] is an efficient and formal technique for finding this optimal state sequence and its probability.
The third problem in building an HMM is the training problem: how does one estimate the parameters of the model? The problem is difficult because there are several levels of estimation required in an HMM. First, the states themselves must be estimated. Then, the model parameters λ=(A,B,π) need to be estimated. For the discrete HMM, first the codebook is determined, usually using the Kmeans algorithm [18], or other vector quantization algorithms. Then, the parameters (A,B,π) are estimated iteratively using the BaumWelch learning algorithm [19].
2.2 Baseline HMM classifier for landmine detection
The baseline HMM classifier for GPRbased landmine detection was first introduced in [5]. It consists of two HMM models, one for mines and one for the background. Each model has four states and produces a probability value by backtracking through model states using the Viterbi algorithm [17]. The mine model, λ ^{ m }, is designed to capture the spatial distribution of the features. This model assumes that mine signatures have a hyperbolic shape comprised of a succession of rising, horizontal, and falling edges with variable duration in each state. The beginning and the end of the observation vectors correspond typically to a nonedge (or background) state. The background model, λ ^{ b }, is needed to capture the background and clutter characteristics. No prior information or assumptions are used for this model.
where #{s _{ t }=1,t=1,⋯,T} corresponds to the number of observations assigned to the background state (state 1). T _{max} is defined experimentally based on the shortest mine signature. Equation (2) ensures that sequences with a large number of observations assigned to state 1 are considered nonmines.
2.3 Extensions to the baseline HMM for landmine detection
In an effort to improve performance and generalization, several extensions to the baseline HMM have been proposed. For instance, in [12, 13], the authors proposed the multistream HMM (MSHMM) that combines multiple sets of features. An optimal weight for each feature was learned in the training phase. In [14], maximum likelihood (ML) and minimization of classification error (MCE) learning methods were derived for the MSHMM. In [15], HMMs with stickbreaking priors (SBHMM) [20] were employed to learn the number of HMM states in the baseline HMM landmine detector. This approach relies on a variational Bayesian learning technique in lieu of standard BW training.
3 Motivations
The baseline HMM represents each class by a single model learned from all the observations within that class. The goal is to generalize from all the training data in order to classify unseen observations. However, for complex classification problems with large intraclass variations, combining observations with different characteristics to learn one model might lead to too much averaging thus, lose the discriminating characteristics of the observations.
Consequently, learning a set of models that reflect different characteristics of the observations might be more beneficial than using one global model for each class. This is typical in many classifiers such as the KNN [9], which uses different prototypes for each class, and the SVM [10], which uses multiple support vectors.
In this paper, we develop a new approach that replaces the twomodel classifier with one that includes multiple models for each class. For instance, each group of signatures in Fig. 3 would be used to learn a different model. Our approach aims to capture the characteristics of the observations that would be lost under averaging in the twomodel case.
We hypothesize that under realistic conditions, the data are generated by multiple models. The proposed approach, called ensemble HMM (eHMM), attempts to partition the training data within the loglikelihood space and identify multiple clusters in an unsupervised manner. Depending on each cluster’s homogeneity and size, an appropriate training scheme is applied to learn the corresponding HMM parameters. The resulting K HMMs are then aggregated through a decision level fusion component to form a descriptive model for the data.
4 Ensemble HMM architecture
Let \(\mathbb {O}=\left \{O_{r},y_{r}\right \}_{r=1}^{R}\) be a set of R labeled sequences of length T where \(O_{r}=\left \{O_{r}^{(1)},\cdots,O_{r}^{(T)}\right \}\) and y _{ r }∈{1,⋯,C} is the label (class) of sequence O _{ r }. First, we need to identify subgroups of observations that have common patterns. Ground truth information could not be used for this task as it is insufficient and unreliable. For instance, a large deep buried mine can have a signature similar to a small shallow buried mine. Furthermore, the same mine buried at the same depth in soil with different properties may have different signatures. Thus, the partitioning needs to be done in an unsupervised way, i.e., regardless of the observation’s labels and the limited ground truth information. In our approach, we use unsupervised learning to cluster the set of all observations, \(\mathbb {O}\), into subgroups of “similar” observations. The first step in this approach is to define a measure of similarity between two observations.
4.1 Similarity between observations in the loglikelihood space
4.1.1 4.1.1 Fitting individual models to sequences
Initially, each sequence in the training data, O _{ r }, 1≤r≤R is used to learn an HMM model λ _{ r }. Even though using only one sequence of observations to learn an HMM might lead to overfitting, this technique is only an intermediate step that aims to capture the characteristics of each sequence. The produced HMM model is meant to give a maximal description of each sequence, and therefore, overfitting is not an issue in this context. In fact, it is desired that the model perfectly fits the observation sequence. In this case, the likelihood of each sequence with respect to its corresponding model is expected to be higher than those with respect to the remaining models.
Let \(\left \{\lambda _{r}^{(0)}\right \}_{r=1}^{R}\) be the set of initial models and let \(s_{n}^{(r)}, \ 1 \leq n \leq N\), be the representative of each state in \(\lambda _{r}^{(0)}\). Each model has N states. First, the model states can be assigned to the sequence observations either heuristically, using domain knowledge, or automatically by clustering the sequence observations into N clusters. In our approach, we use the latter and we define the states’ means and observations as the center and elements of each resulting cluster, respectively. Consequently, the transition matrix and the initial probabilities of \(\lambda _{r}^{(0)}\) are set according to the aforementioned associations. For the emission probabilities, the initialization differs whether we use the discrete or continuous HMM.
Then, the BaumWelch algorithm [21] is used to adapt the model parameters to each given observation. Let \(\{\lambda _{r}\}_{r=1}^{R}\) be the set of trained individual models.
Next, we need to define a measure that evaluates the similarity between pairs of observation sequences. While similarity between static data observations is straightforward and well defined, defining a similarity between observation sequences is more of a challenge. Within the context of HMM modeling, we consider two observation sequences similar if: (i) they fit each other’s models; and (ii) they have similar Viterbi optimal paths [17].
4.1.2 4.1.2 Loglikelihoodbased similarity
In (7), L can be computed using the forward–backward procedure mentioned in Section 2.1. When the loglikelihood value is high, it is likely that model λ _{ j } generated sequence O _{ i }. In this case, sequences O _{ i } and O _{ j } are expected to have common salient features and are considered to be similar. On the other hand, when the likelihood term is low, it is unlikely that model λ _{ j } generated the sequence O _{ i }. In this case, O _{ i } and O _{ j } are considered to be dissimilar. For each observation sequence O _{ r }, 1≤r≤R, we compute its likelihood in each model λ _{ p }, P r(O _{ r }λ _{ p }), for 1≤p≤R. This will result in an R×R loglikelihood matrix.
4.1.3 4.1.3 Pathmismatchbased penalty
In (8), P(i,j) is the distance between the Viterbi optimal path, Q ^{(j i)}, of testing sequence O _{ i } with model λ _{ j }, and the Viterbi optimal path of testing sequence O _{ j } with model λ _{ j }, Q ^{(j j)}. In (8), D _{ Edit } is the “edit distance” [17], commonly used in string comparisons. The “edit distance” between two strings, say p and q, is defined as the minimum number of singlecharacter edit operations (deletions, insertions, and/or replacements) that would convert p into q. The Viterbi path mismatch term is intended to ensure that similar sequences have few mismatches in their corresponding Viterbi optimal paths. Since the Viterbi path is already available when using the forward–backward procedure for the likelihood computation, the penalty term does not require significant additional computation.
In (9), the mixing factor, α∈[0,1], is a tradeoff parameter between the loglikelihoodbased similarity and the Viterbipathmismatchbased dissimilarity. It is estimated experimentally by maximizing the intraclass similarity and minimizing the interclass similarity across the training data. A larger value of α corresponds to a dominant loglikelihoodbased similarity where the need for the penalty mismatch is not significant. A smaller α corresponds to a more significant path mismatch penalty.
4.2 Pairwise distancebased clustering
The distance matrix, computed using (10), reflects the degree to which pairs of sequences are considered similar. The largest variation is expected to be between sequences from different classes. Other significant variations may exist within the same class, e.g., the groups of signatures shown in Fig. 3. Our goal is to identify the similar groups so that one model can be learned for each group. This task can be achieved using any relational clustering algorithm. In our work, we use the standard agglomerative hierarchical algorithm [18].
where n _{ k } and c _{ k } are the cardinality and the centroid of cluster C _{ k }, respectively. It has been shown in [17] that this approach merges the two clusters that lead to the smallest increase in the overall variance.
4.3 Ensemble HMM initialization and training
The previous clustering step results in K clusters, each comprised of potentially similar sequences. Each cluster is then used to learn an HMM, resulting in an ensemble of K HMMs. Let N _{ k } denote the number of sequences assigned to the same cluster k. Since our clustering step did not use class labels, clusters may include sequences from different classes. Let \(N_{k}^{(c)}\) be the number of sequences in cluster k that belong to class c, such that \(\sum _{c=1}^{C}{N_{k}^{(c)}}=N_{k}\). For instance, for the landmine example, if we let c=1 denote the class of mines and c=0 denote the class of clutter, \(N_{k}^{(1)}\) would be the number of mines assigned to cluster k.
The next step of our approach consists of learning a set of HMMs that reflect the diversity of the training data. Since a cluster contains a set of similar sequences, and each cluster may include observations from different classes, we learn one HMM model \(\left \{\lambda _{k}^{(c)}\right \}\) for each set of sequences assigned to class c within cluster k. Let \(\mathbb {O}_{k}^{(c)}=\left \{O_{r}^{(c)},y_{r}^{(c)}\right \}_{r=1}^{N_{k}^{(c)}}\) be the set of sequences partitioned into cluster k that belong to class c and let \(\left \{\lambda _{r}^{(c)}\right \}_{r=1}^{N_{k}^{(c)}}\) be their corresponding individual HMM models, c∈{1,⋯,C}.

Clusters dominated by sequences from only one class: In this case, we learn only one model for this cluster. The sequences within this cluster are presumably similar and belong to the same ground truth class, denoted C _{ i }. We assume that this cluster is a representative of that particular dominating class. It is expected that the class conditional posterior probability is unimodal and peaks around the MLE of the parameters. Thus, a maximum likelihood estimation would result in an HMM that best fits this particular class. For these reasons, we use the standard BaumWelch reestimation procedure [21]. Let K _{1} be the number of homogenous clusters that fit into this category and let \(\left \{\lambda _{i}^{(C_{i})},i=1,\cdots,K_{1}\right \}\) denote the set of BWtrained models.

Clusters with a mixture of observations belonging to different classes: In this case, it is expected that the posterior distribution of the classes is multimodal. Thus, we need to learn one model for each class represented in this cluster. The MLE approach is not adequate, and more discriminative learning techniques such as genetic algorithms [23] or simulated annealing optimization [24] are needed to address the multimodality. In our work, we build a model for each class within the cluster. We focus on finding the class boundaries within the posteriors rather than trying to approximate a joint posterior probability. Thus, the models’ parameters are jointly optimized to minimize the overall misclassification error using a discriminative learning approach [25]. Let K _{2} be the number of mixed clusters that fit into this category and let \(\left \{\lambda _{j}^{(c)},j=1,\cdots,K_{2},c=1, \cdots,C\right \}\) be the set of MCEtrained models.

Clusters containing a small number of sequences: The MLE and MCE learning approaches need a large number of data points to give robust estimates of the model parameters. Thus, when a cluster has few samples, the above approaches may not be reliable. Ignoring these clusters is not a good option as they may contain information about sequences with distinctive characteristics. The Bayesian training framework [26], on the other hand, is suitable to learn model parameters using a small number of training sequences. Specifically, we select only the dominating class for this cluster and learn a single model using a variational Bayesian approach [26] to approximate the class conditional posterior distribution. Let K _{3} be the number of small clusters that fit into this category and let \(\left \{\lambda _{k}^{(C_{k})}, k=1,\cdots,K_{3}\right \}\) denote the set of Bayesiantrained models.
To summarize, for each homogenous cluster i, we define one model \(\lambda _{i}^{(C_{i})}\), i=1,⋯,K _{1}, for the dominating class C _{ i }. For mixed clusters, we define C models per cluster: \(\lambda _{j}^{(c)}\), c=1…C, j=1,⋯,K _{2}. For each small cluster, we define one model \(\lambda _{k}^{(C_{k})}\) for the dominating class C _{ k }. The ensemble HMM mixture is defined as \(\left \{\lambda _{k}^{(c)}\right \}\), where k∈{1,⋯,K}, and c=C _{ k } if cluster k is dominated by sequences labeled with class C _{ k }, and c∈{1⋯,C} if cluster k is a mixed cluster.

Discrete HMM (DHMM): the state representatives and the codebook of model \(\lambda _{k}^{(c)}\) are obtained by partitioning and quantizing the observations \(\mathbb {O}_{k}^{(c)}\). First, sequences from cluster k that belong to class c, \(O_{r}^{(c)}\), are “unrolled” to form a vector of observations U ^{(k,c)} of length \(N_{k}^{(c)}T\). The state representatives, \(s_{n}^{(k,c)}\), are obtained by clustering U ^{(k,c)} into N clusters and taking the centroid of each cluster as the state representative. Similarly, the codebook \(\textbf {V}^{(k,c)}=\left [v_{1}^{(k,c)},\cdots,v_{M}^{(k,c)}\right ]\) is obtained by clustering U ^{(k,c)} into M clusters. For each symbol \(v_{m}^{(k,c)}\), the membership in each state \(s_{n}^{(k,c)}\) is computed using$$ b_{n}^{(k,c)}(m) = \frac{\frac{1}{\v_{m}^{(k,c)} s_{n}^{(k,c)}\}}{\sum_{l=1}^{N}\frac{1}{\left\v_{m}^{(k,c)}s_{l}^{(k,c)} \right\}}, 1 \leq m \leq M. $$(12)To satisfy the requirement \(\sum _{m=1}^{M}b_{n}^{(k,c)}(m)=1\), we scale the values by:$$ b_{n}^{(k,c)}(m) \longleftarrow \frac{b_{n}^{(k,c)}(m)}{\sum_{l=1}^{M}b_{n}^{(k,c)}(l)} $$(13)

Continuous HMM (CHMM): we assume that each state has N _{ g } Gaussian components. For each model \(\lambda _{k}^{(c)}\), as in the discrete case, we define a vector of observations, U ^{(k,c)}. First, U ^{(k,c)} is partitioned into N clusters and the center of cluster n is taken as state \(s_{n}^{(k,c)}\). Let \(\textbf {U}_{n}^{(k,c)}\) be the observations assigned to cluster n. Next, we partition \(\textbf {U}_{n}^{(k,c)}\) into N _{ g } clusters using the kmeans algorithm [27]. The mean of each component, \(\mu _{n}^{(k,c,g)}\), is the center of one of the resulting clusters, and the covariance, \(\Sigma _{n}^{(k,c,g)}\), is estimated using the observations that belong to that same cluster. If we denote by \(\textbf {U}_{n}^{(k,c,g)}\) the observations that belong to component g of state \(s_{n}^{(k,c)}\), the parameters of \(\lambda _{k}^{(c)}\) are computed using$$ \mu_{n}^{(k,c,g)} = \text{mean}\left\{\textbf{U}_{n}^{(k,c,g)}\right\}, 1 \leq n \leq N, 1 \leq g \leq N_{g}. $$(14)$$ \Sigma_{n}^{(k,c,g)} = \text{covariance} \left\{\textbf{U}_{n}^{(k,c,g)}\right\}, 1 \leq n \leq N, 1 \leq g \leq N_{g}. $$(15)
For both the discrete and continuous cases, any clustering algorithm, such as the Kmeans [27] or the fuzzy cmeans [28], could be used to identify the states, codebook, or the multiple components. After initialization, we use one of the training schemes described earlier, to update \(\lambda _{k}^{(c)}\) parameters using the respective observations \(\mathbb {O}_{k}^{(c)}\), k∈{1,⋯,K}, c∈{1,⋯,C}. As mentioned earlier, for homogenous clusters, BW training results in one model λ ^{ B W } per cluster; for mixed clusters, MCE training results in C models per cluster, \(\lambda ^{MCE}_{c}, c=1 \ldots C\); and for small clusters, variational Bayesian learning results in one model per cluster, λ ^{ V B }. The output of BaumWelch and VBtrained cluster models is P r(Oλ _{ k }) while the output of the MCEtrained cluster models is \(\max _{c}{Pr\left (O\lambda _{k,c}^{MCE}\right)}\).
4.4 Decision level fusion
The partial confidence values of the different models need to be combined into a single confidence value. Let \(\Lambda =\left \{\lambda ^{BW}_{i},\lambda ^{MCE}_{j},\lambda ^{VB}_{k}\right \}\) be the resulting mixture model composed of a total of K models, K=K _{1}+K _{2}+K _{3}.
Let F(k,r)= logP r(O _{ r }λ _{ k }),1≤r≤R,1≤k≤K, be the loglikelihood matrix obtained by testing the R training sequences with the K models. Each column f _{ r } of matrix F represents the feature vector of each sequence in the decision space (recall that f _{ r } is a Kdimensional vector while O _{ r } is a sequence of vector observations of length T). In other words, each column represents the confidences assigned by the K models to each sequence r. Therefore, the set of sequences \(\mathbb {O}=\{O_{r},y_{r}\}_{r=1}^{R}\) is mapped to a confidence space \(\{\textbf {f}_{r}, y_{r}\}_{r=1}^{R}\). Finally, a combination function, \(\mathbb {H}\), takes all the f _{ r }’s as input and outputs the final decision. The general framework for fusing the K outputs is highlighted in Algorithm ??.
Several decision level fusion techniques such as simple algebraic [29], artificial neural networks (ANN) [30], and hierarchical mixture of experts (HME) [31] can be used. In our work, we use an ANN with a singlelayer perceptron and no hidden layers. The ANN weights are learned from the labeled training data using the backpropagation algorithm [32].
The architecture of the proposed eHMM is summarized in Fig. 1. It is composed of four main components: similarity matrix computation, relational clustering, adaptive training scheme, and decision level fusion. To test a new sequence, the outputs of the different models are aggregated into a single confidence value using Algorithm ??.
5 Application to landmine detection using groundpenetrating radar data
5.1 Data collections
Raw GPR data needs to be preprocessed and prescreened. Preprocessing includes groundlevel alignment and signal and noise background removal. Prescreening is needed to focus attention and identify regions with subsurface anomalies. For this step, we use the adaptive least mean squares (LMS) prescreener [34]. The LMS flags locations of interest utilizing a computationally inexpensive algorithm so that more advanced algorithms can be applied only on the small subsets of data flagged by the prescreener.
Data collections
Total number of  Mine encounters  False alarms  

prescreened alarms  
\(\mathfrak {D_{1}}\)  2477  732  1745 
\(\mathfrak {D_{2}}\)  1343  724  619 
\(\mathfrak {D_{3}}\)  1843  613  1230 
5.2 Feature extraction
The goal of the feature extraction step is to transform original GPR data into a sequence of observation vectors. We use two types of features that have been proposed and used independently. Each feature represents a different interpretation of the raw data and aims at providing a good discrimination between mine and clutter signatures. These features are outlined in the following subsections.
5.2.1 5.2.1 EHD features
This feature is based on the edge histogram descriptor [9] (EHD) and characterizes edges in the spatial domain. The EHD captures the salient properties of the 3D alarms in a compact and translation invariant representation. It extracts edge histograms capturing the frequency of occurrence of edge orientations in the data associated with a ground position. Simple edge detector operators are used to identify edges and group them into five categories: vertical, horizontal, diagonal, antidiagonal, and isotropic (nonedge). Each Bscan position is then represented by a fivedimensional observation vector. Each dimension of this vector represents the percentage of pixels (in a small interval along the depth) that belong to each of the five edge categories.
5.2.2 5.2.2 Gabor features
Gabor features characterize edges in the frequency domain at multiple scales and orientations and are based on Gabor wavelets [7]. This feature is extracted by expanding the signature’s Bscan (depth vs. downtrack) using a bank of scale and orientation selective Gabor filters. Expanding a signal using Gabor filters provides a localized frequency description. In our experiments, we use a bank of filters tuned to the combination of three scales and four orientations. Each observation is then represented by a 12dimension feature vector.
5.3 Ensemble HMM implementation and results
In all experiments reported in this paper, we use a sixfold cross validation for each data collection \(\mathfrak {D_{l}}\), l∈{1,2,3}. For each fold, a subset of the data (\(\mathfrak {D_{l}}_{\textit {Trn}}\)) is used for training and the remaining data (\(\mathfrak {D_{l}}_{\textit {Tst}}\)) is used for testing. \(\mathfrak {O_{l}}_{\textit {Trn}}^{Feat}\) denotes the set of observation sequences extracted from dataset \(\mathfrak {D_{l}}\), using one of the feature extraction methods, “Feat” (EHD or Gabor).
\({\lambda ^{M}_{6}}\) CHMM model parameters of cluster 6
Means  Weights  

H  V  D  A  N  w  
s _{1}  c _{11}  0.21  0.17  0.41  0.07  0.13  g _{11}  0.30 
c _{12}  0.36  0.12  0.25  0.11  0.17  g _{12}  0.22  
c _{13}  0.15  0.18  0.23  0.06  0.37  g _{13}  0.49  
s _{2}  c _{21}  0.42  0.09  0.25  0.12  0.13  g _{21}  0.30 
c _{22}  0.37  0.10  0.10  0.30  0.13  g _{22}  0.20  
c _{23}  0.59  0.05  0.10  0.12  0.14  g _{23}  0.50  
s _{3}  c _{31}  0.38  0.11  0.10  0.26  0.16  g _{31}  0.21 
c _{32}  0.20  0.17  0.07  0.43  0.13  g _{32}  0.30  
c _{33}  0.14  0.20  0.06  0.25  0.36  g _{33}  0.49  
A  
s _{1}  s _{2}  s _{3}  
s _{1}  0.73  0.27  0.00  
s _{2}  0.00  0.67  0.33  
s _{3}  0.00  0.00  1.00 
\(\lambda ^{M}_{10}\) CHMM model parameters of cluster 10
Means  Weights  

H  V  D  A  N  w  
s _{1}  c _{11}  0.14  0.13  0.17  0.08  0.48  g _{11}  0.27 
c _{12}  0.26  0.11  0.20  0.06  0.37  g _{12}  0.40  
c _{13}  0.16  0.04  0.10  0.05  0.66  g _{13}  0.32  
s _{2}  c _{21}  0.30  0.07  0.10  0.14  0.39  g _{21}  0.50 
c _{22}  0.48  0.05  0.11  0.14  0.22  g _{22}  0.28  
c _{23}  0.27  0.12  0.07  0.37  0.16  g _{23}  0.21  
s _{3}  c _{31}  0.09  0.11  0.03  0.18  0.59  g _{31}  0.60 
c _{32}  0.22  0.17  0.05  0.36  0.20  g _{32}  0.04  
c _{33}  0.10  0.20  0.02  0.31  0.36  g _{33}  0.36  
A  
s _{1}  s _{2}  s _{3}  
s _{1}  0.74  0.26  0.00  
s _{2}  0.00  0.89  0.11  
s _{3}  0.00  0.00  1.00 
The main conclusion that we can draw from the above example is that \(\lambda _{6}^{(M)}\) and \(\lambda _{10}^{(M)}\) are very different and characterize two distinct subsets of the training data. The standard HMM approach would combine all alarms to learn a single model for mines (weak and strong) and a single model for clutter.
In the final step, the eHMM mixture is combined using a singlelayer ANN. The ANN parameters are trained to fit the responses of the eHMM mixture models to the training data labels.
5.4 eHMM vs. baseline HMM results
AUC of the ensemble HMM and baseline HMM classifiers
Dataset  Classifier using:  EHD  Gabor 

\(\mathfrak {D_{1}}\)  Ensemble DHMM  712  719 
Baseline DHMM  643  499  
Ensemble CHMM  718  617  
Baseline CHMM  614  472  
\(\mathfrak {D_{2}}\)  Ensemble DHMM  402  127 
Baseline DHMM  107  30  
Ensemble CHMM  359  122  
Baseline CHMM  209  102  
\(\mathfrak {D_{3}}\)  Ensemble DHMM  343  296 
Baseline DHMM  272  122  
Ensemble CHMM  326  226  
Baseline CHMM  284  140 
6 Conclusions
In this work, we have proposed a novel ensemble HMM classification method that is based on clustering sequences in the loglikelihood space. The eHMM uses multiple HMM models and fuses them for final decision making. We hypothesized that the data are generated by multiple models. These different models reflect the fact that samples from the same class can have different characteristics resulting in large intraclass variability.
The eHMM, in its discrete and continuous versions, was implemented and evaluated using large collections of landmine GPR data. We examined the intermediate steps of the eHMM and compared its performance to the baseline HMM. Results on three GPR data collections show that the proposed method can identify meaningful and coherent HMM mixture models that describe different properties of the data. Each individual HMM characterizes a group of data that share common attributes. The experiments show that the proposed eHMM intermediate results are inline with the expected behavior. The results also indicate that, for both the continuous and discrete versions, the proposed method outperforms the baseline HMM that uses one model for each class in the data.
7 Endnote
^{1} The details of the landmine detection application using GPR signatures will be presented in section 5.
Declarations
Acknowledgements
This work was supported in part by U.S. Army Research Office Grants Number W911NF1310066 and W911NF14 10589. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office, or the U.S. Government.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Landmine Monitor Report, (2013). http://www.themonitor.org/.
 TR Witten, in SPIE Conf Detection and Remediation Technologies for Mines and Minelike Targets III. Present state of the art in groundpenetrating radars for mine detection (Orlando, FL, 1998), pp. 576–586.Google Scholar
 PD Gader, H Frigui, BN Nelson, G Vaillette, JM Keller, in SPIE Conf Detection and Remediation Technologies for Mines and Minelike Targets IV. New results in fuzzy set based detection of landmines with GPR (Orlando, FL, 1999), pp. 1075–1084.Google Scholar
 PD Gader, B Nelson, H Frigui, G Vaillette, JM Keller, Fuzzy logic detection of landmines with ground penetrating radar. Signal Process. Special Issue Fuzzy Logic Signal Process. 80, 1069–1084 (2000).MATHGoogle Scholar
 PD Gader, M Mystkowski, Y Zhao, Landmine detection with ground penetrating radar using hidden Markov models. IEEE Trans. Geosci. Remote Sensing. 39, 1231–1244 (2001).View ArticleGoogle Scholar
 H Frigui, DKC Ho, PD Gader, Realtime landmine detection with groundpenetrating radar using discriminative and adaptive hidden Markov models. EURASIP J. Appl. Signal Process. 12, 1867–1885 (2005).View ArticleGoogle Scholar
 H Frigui, O Missaoui, PD Gader, in SPIE Conf. Detection and Remediation Technologies for Mines and Minelike Targets XII. Landmine detection using discrete hidden Markov models with Gabor features (Louisville, KY, USA, 2007).Google Scholar
 H Frigui, PD Gader, S Kotturu, in SPIE Conf. Detection and Remediation Technologies for Mines and Minelike Targets. Detection and discrimination of landmines in ground penetrating radar using an eigenmine and fuzzy membership function approach, (2004). doi:10.1109/TFUZZ.2008.2005249.
 H Frigui, PD Gader, in Proceedings of the IEEE International Conference on Fuzzy Systems. Detection and discrimination of land mines based on edge histogram descriptors and fuzzy knearest neighbors (Vancouver, BC, Canada, 2006).Google Scholar
 A Karem, A Fadeev, H Frigui, Gader, P, in Society of PhotoOptical Instrumentation Engineers (SPIE) Conference Series, 7664. Comparison of different classification algorithms for landmine detection using GPR, (2010), p. 2. doi:10.1117/12.852257.
 PA Torrione, KD Morton, R Sakaguchi, LM Collins, Histograms of oriented gradients for landmine detection in groundpenetrating radar data. Geosci. Remote Sens. IEEE Trans. 52(3), 1539–1550 (2014). doi:10.1109/TGRS.2013.2252016.
 O Missaoui, H Frigui, P Gader, in Geoscience and Remote Sensing Symposium (IGARSS), 2010 IEEE International. Model level fusion of edge histogram descriptors and Gabor wavelets for landmine detection with ground penetrating radar, (2010), pp. 3378–3381. doi:10.1109/IGARSS.2010.5650350.
 O Missaoui, H Frigui, P Gader, in Machine Learning and Applications, 2009. ICMLA ’09. International Conference On. Discriminative multistream discrete hidden Markov models, (2009), pp. 178–183. doi:10.1109/ICMLA.2009.121.
 O Missaoui, H Frigui, P Gader, Multistream continuous hidden Markov models with application to landmine detection. EURASIP J. Adv. Signal Process. 1 (2013). doi:10.1186/16876180201340.
 CR Ratto, KD Morton, LM Collins, PA Torrione, in Geoscience and Remote Sensing Symposium (IGARSS), 2011 IEEE International. A hidden Markov context model for GPRbased landmine detection incorporating stickbreaking priors, (2011), pp. 874–877. doi:10.1109/IGARSS.2011.6049270.
 LR Rabiner, in Proceedings of the IEEE. A tutorial on hidden Markov models and selected applications in speech recognition (San Francisco, CA, USA, 1989), pp. 257–286.Google Scholar
 S Theodoridis, K Koutroumbas, Pattern Recognition, Fourth Edition (Academic Press, Inc, Orlando, FL, USA, 2009).Google Scholar
 R Duda, P Hart, Pattern Classification and Scene Analysis (John Wiley and Sons, New York, 1973).MATHGoogle Scholar
 LE Baum, T Petrie, Statistical Inference for Probability Functions of Finite State Markov Chains. Ann. Math. Stat. 37, 1554–1563 (1966).MATHMathSciNetView ArticleGoogle Scholar
 J Paisley, L Carin, Hidden Markov models with stickbreaking priors. Signal Process. IEEE Trans. 57(10), 3905–3917 (2009). doi:10.1109/TSP.2009.2024987.MathSciNetView ArticleGoogle Scholar
 LE Baum, T Petrie, Statistical inference for probabilistic functions of finite state Markov chains. Ann. Math. Stat. 37(6), 1554–1563 (1966). doi:10.2307/2238772.MATHMathSciNetView ArticleGoogle Scholar
 JH Ward, Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58(301), 236–244 (1963).View ArticleGoogle Scholar
 JH Holland, Adaptation in Natural and Artificial Systems (MIT Press, Cambridge, MA, USA, 1992).Google Scholar
 S Kirkpatrick, CD Gelatt, MP Vecchi, Optimization by Simulated Annealing. Science. 220(4598), 671–680 (1983). doi:10.1126/science.220.4598.671.MATHMathSciNetView ArticleGoogle Scholar
 BH Juang, W Chou, CH Lee, Minimum Classification Error Rate Methods for Speech Recognition. Trans. Speech Audio Process. 5(3), 257–265 (1997).View ArticleGoogle Scholar
 DJC MacKay, Information Theory, Inference and Learning Algorithms (Cambrdige University Press, New York, 2003).MATHGoogle Scholar
 AK Jain, RC Dubes, Algorithms for Clustering Data (Prentice Hall, Upper Saddle River, NJ, USA, 1988).MATHGoogle Scholar
 JC Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms (Kluwer Academic Publishers, Norwell, MA, USA, 1981).MATHView ArticleGoogle Scholar
 G Fumera, F Roli, A theoretical and experimental analysis of linear combiners for multiple classifier systems. IEEE Trans. Pattern Anal. Mach. Intell. 27(6), 942–956 (2005). doi:10.1109/TPAMI.2005.109.View ArticleGoogle Scholar
 DS Lee, SN Srihari, in ICDAR ’95: Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1). A theory of classifier combination: the neural network approach (IEEE Computer SocietyWashington, DC, USA, 1995), p. 42.Google Scholar
 MI Jordan, RA Jacobs, in NIPS. Hierarchies of adaptive experts (Denver, 1991), pp. 985–992.Google Scholar
 DE Rumelhart, GE Hinton, RJ Williams, Learning internal representations by error propagation, 318–362 (1986). ISBN:026268053X.Google Scholar
 KJ Hintz, in Proceedings of the SPIE Conference on Detection and Remediation Technologies for Mines and Minelike Targets IX. SNR improvements in Niitek ground penetrating radar (Orlando, FL, USA, 2004).Google Scholar
 PA Torrione, CS Throckmorton, LM Collins, Performance of an adaptive featurebased processor for a wideband ground penetrating radar system. Aerospace and Electronic Systems, IEEE Transactions on. 2. pp. 644,658, April 2006.Google Scholar