Skip to main content

Image retrieval by information fusion based on scalable vocabulary tree and robust Hausdorff distance


In recent years, Scalable Vocabulary Tree (SVT) has been shown to be effective in image retrieval. However, for general images where the foreground is the object to be recognized while the background is cluttered, the performance of the current SVT framework is restricted. In this paper, a new image retrieval framework that incorporates a robust distance metric and information fusion is proposed, which improves the retrieval performance relative to the baseline SVT approach. First, the visual words that represent the background are diminished by using a robust Hausdorff distance between different images. Second, image matching results based on three image signature representations are fused, which enhances the retrieval precision. We conducted intensive experiments on small-scale to large-scale image datasets: Corel-9, Corel-48, and PKU-198, where the proposed Hausdorff metric and information fusion outperforms the state-of-the-art methods by about 13, 15, and 15%, respectively.

1 Introduction

Image retrieval is an important task in computer vision, which is particularly useful in applications such as Internet image identification for image classification, search and annotation. In recent years, a number of Internet image search systems have been developed [1, 2], which focused on learning a statistical model for mapping image content features to classification labels.

In recent years, a number of deep learning frameworks have been proposed for content-based image retrieval (CBIR). Back-propagation, since 1980s, has been a well-known algorithm for learning the weights of neural networks [3, 4], and widely used in deep learning networks. A typical deep learning approach consists of three phases: (i) training a deep learning model from training data with pre-defined labels; (ii) pass the images through the trained model to extract the feature representations; and finally (iii) applying fully connection layers of the deep architecture or other models such as K-nearest neighbor (KNN) to obtain the best match images. Specifically, for the first stage, several deep architecture of Convolutional Neural Networks (CNNs) can be applied [58]. Deep learning approaches for image retrieval achieves the best performance in recent years. However, for a large collection of dataset images, the deep architecture can only be efficiently trained using powerful graphics processing units (GPUs). In contrast, the scalable vocabulary tree (SVT) framework is proven to be a computationally efficient framework [9] for large-scale image retrieval tasks using normal CPUs.

SVT approaches can be regarded as an extension of Bag-of-Words (BoW) approaches since the visual words can be easily extended to tens of thousand at a logarithmic scale [1013]. In a typical image retrieval framework [1], a scalable vocabulary tree (SVT) is generated by hierarchically clustering local descriptors. First, an initial k-means clustering is performed on the local descriptors of the training data to obtain the first-level k-clusters. Then, the same clustering process is applied recursively to the local descriptors surrounding each cluster at the current level. Repeating this procedure will lead to a hierarchical k-means clustering structure, also known as a SVT in this context. Based on the SVT, the high-dimensional histograms of image local descriptors can be generated, which enables efficient image retrieval. When performing image matching on a SVT, local descriptors of the query image are quantized by traversing each layer of the SVT and a histogram over the tree nodes (visual words) is generated. Candidate images are then sorted according to the similarity of these histograms to the query image histogram. A histogram of the local descriptors is called image signature in this context.

Although SVT-based approaches normally produces good results, there are still potential for us to further improve the performance. For example, Hausdorff distance [1416] is a popular post-processing approach which is used to match image signatures between the query image and database images. Using the conventional Hausdorff distance metric [14] as the image feature matching technique, the maximum local distance is chosen as the matching distance, which produces low distance even if every pair of image feature elements are close between a query image and a dataset image. However, it is sensitive to noise, i.e., the image local descriptor from the cluttered background. The limited improvement achieved when compared to its extra computational cost make it undesirable in large-scale image retrieval. In view of this, we propose to utilize an improved Hausdorff distance for the refinement of SVT matching results. In addition to the new distance metric, a new framework is proposed for large-scale image retrieval which can fuse matching results from different image representation methods. Each individual image representation adopted in the framework has the flexibility to utilize any one of the different image signatures based on SVT.

Specifically, two new algorithms are developed in the framework: (1) an improved Hausdorff Distance algorithm which helps remove outliers and improve retrieval accuracy in the proposed framework; and (2) a fusion algorithm that can combine the matching results generated from different image representation methods, and produces the final matching list. To the best of our knowledge, this is the first work to utilize Hausdorff distance into the SVT-based image retrieval systems.

Experimental results show that, by embedding and fusing different image representation methods in the proposed framework, the image retrieval performance is superior to using each image representation method alone. Experimental results show that, relative to a baseline approach using SVT built upon SIFT descriptors, utilizing Hausdorff distance improves more than 10% of retrieval performance with negligible computational cost.

2 Proposed image retrieval framework

We proposed a new image retrieval framework that incorporate various sources of information relative to the conventional image retrieval based on an individual image representation, as shown in Fig. 1. Based on various local descriptors and visual word vocabularies, various image representations produce various image search lists from the database images. Using the information fusion algorithm, different sets of searching results can be fused to produce the final matching list. The final image retrieval performance is expected to be superior to that produced by each search method since it fuses different sources of information.

Fig. 1
figure 1

Overview of the proposed image retrieval framework

In order to evaluate the effectiveness of the proposed fusion algorithm, several image representations are performed here, as shown in Fig. 2. Specifically, three image search methods are used: (i) SVT approach based on histogram of dense-patch SIFT descriptors; (ii) SVT approach based on kernel density of dense-patch SIFT descriptors; and (iii) SVT approach based on histogram of dense-patch DAISY descriptors. The final image retrieval performance will be validated to be superior to that produced by any of the individual matched image list.

Fig. 2
figure 2

Proposed fusion framework embedded with the proposed image signatures

For an individual image representation, we propose a new image retrieval framework that incorporate the kernel density information and a robust Hausdorff distance metric, as shown in Fig. 3. In the offline training phase, kernel density, which is optional, is incorporated in two stages: vocabulary tree construction and image histogram representation; in the online recognition phase for the query image, the kernel density is only involved in the histogram calculation. To keep the efficiency of the retrieval system, the offline training phase is kept the same as the SVT image retrieval; while in the online retrieval phase for the query image, the Hausdorff distance is only involved in the refinement of SVT image retrieval.

Fig. 3
figure 3

Proposed framework for image retrieval based on individual image representations

3 Proposed algorithms

3.1 Feature extraction

Amongst the local features, scale-invariant feature transform (SIFT) is most widely used in recent years, and some variants of SIFT have also been proposed. The keypoint-based SIFT descriptor developed in [17, 18] is one of the most popular local features [1921]. According to the comparative study in [22], SIFT is generally superior to other descriptors, such as moment invariants [23] and shape context [24], due to its robustness to affine transformations and illumination changes.

During SIFT descriptor extraction, the difference of Gaussian (DoG) space of the image is calculated first. Then, the keypoints are detected by finding those invariant to different scales in the DoG space. Based on the local region surrounding the detected keypoints, the orientation histogram is computed as the local descriptor. SIFT descriptors are relatively robust to image noises, illumination changes, as well as limited changes in viewing angles of the object.

To reduce computational cost, SIFT descriptors can also be extracted based on regular patches rather than the detected interest keypoints, which is usually called dense patch SIFT or dense sampling SIFT. Since the detected SIFT keypoints are robust to scales and rotations, the sparse SIFT descriptors are commonly used in general object matching. However, it has been reported that dense SIFT outperform sparse SIFT in some applications [2530]. According to the survey made in [31], SIFT descriptors based on dense overlapping regular image patches is promising among state-of-the-art image feature extraction methods. According to the preliminary experiments in this work, sparse SIFT descriptors will produce about 16% less precision than that of dense-patch SIFT descriptors, which is consistent to the experimental results in [31]. Therefore, we only extract dense-patch descriptors for the image retrieval task.

DAISY is another state-of-the-art image descriptor [32, 33] which needs about only one tenth of the number of computational operations of SIFT descriptors [33]. DAISY is essentially similar to SIFT, except that is uses a Gaussian kernel to aggregate the gradient histograms in different bins whereas SIFT relies on a triangular shaped kernel. The performance of the dense patch daisy descriptors is comparable to dense patch sift descriptors. However, SIFT is found to be still outperforming DAISY in some application domains [34].

3.2 Image representation

Towards large-scale image retrieval, the Scalable Vocabulary Tree (SVT) model [1] is well exploited in the state-of-the-art works [1, 2, 9, 28, 35]. A SVT T is constructed by hierarchically clustering of local descriptors, which consists of C=B L codewords, where B is the branch factor and L is the depth. Let each node v l,hT in the tree represents a visual codeword, where l{0,1,,L} indicates its level, and h{1,2,,B Ll} is the index of its node in its level. A query image q is represented by a bag of N q local descriptors X q ={x q,i },iN q . Each x q,i is traversed in T from the root to a leaf to find the nearest node, resulting in the quantization \(T(\mathbf {x}_{i})=\left \{ \mathbf {v}_{i}^{l,h}\right \}_{l=1}^{L}\). Thus, a query image is eventually represented by a high-dimensional sparse Bag-of-Words (BoW) histogram \(\mathbf {H}_{q}=[h_{1}^{q},\ldots,h_{C}^{q}]\) obtained by counting the occurrence of its descriptors on each node of T.

After clustering the SIFT descriptors of training image patches by SVT, we obtain the codewords which are the cluster centers. For each codeword c in the codebook \(\mathcal {CB}\), traditional codebook model estimates the distribution of codewords in an image by a histogram as follows:

$$ \mathrm{H}\left(\mathbf{c}\right)=\frac{1}{N}\sum\limits_{i=1}^{N}\mathrm{I}\left(\mathbf{c}=\mathop{\text{arg}\;\text{min}}\limits_{\mathbf{w}}\left(D\left(\mathbf{w},\mathbf{v}_{i}\right)\right)\right)_{\mathbf{w}\in\mathcal{CB}} $$

where N is the number of patches in an image, v i is the descriptor of an image patch, D(·,·) is the Euclidean distance, and I(·) is the identity function.

A robust alternative to histograms for estimating a probability density function is kernel density estimation (KDE) by [36]. KDE uses a kernel function to smooth the local neighborhood of data samples. KDE is advantageous over histogram. First, its nonparametric nature provides us with enough flexibility to model feature distributions for a broad and diverse set of scenes. Second, in contrast to the histogram estimator, its smoothing parameter can be adjusted to make the descriptors relatively insensitive to small descriptor variations and to imperfections in scale normalization. Third, descriptors can still be computed very efficiently when KDE is coupled with the fast Gauss transform (FGT) by [37]. A high-dimensional estimator with kernel K and bandwidth parameter B is given by

$$ f\left(\mathbf{c}\right)=\frac{1}{N}\sum\limits_{i=1}^{N}\mathrm{K}_{\mathbf{B}}\left(\mathbf{v}_{i}-\mathbf{c}\right) $$

In this paper, we use the SIFT descriptor that draws on the Euclidean distance as its distance function. Since the Euclidean distance that we measure between SIFT descriptors assumes a Gaussian distribution, Gaussian-shaped kernel is adopted here:

$$ \mathrm{K}_{\mathbf{B}}\left(\mathbf{x}\right)=\frac{1}{\left(2\pi\right)^{\frac{m}{2}}\left|\mathbf{B}\right|^{\frac{1}{2}}}exp\left(-\frac{1}{2}\mathbf{x}^{T}\mathbf{B}^{-1}\mathbf{x}\right) $$

where m is the dimensionality of the descriptor, and the bandwidth parameter matrix \(\mathbf {B}\in \mathcal {R}^{m{\times }m}\) models the degree of uncertainty about the sources and controls the smoothing behavior of the KDE.

We use 10-fold cross-validation for determining the optimal parameters. Hence, the size of the kernel is dependent on the dataset and the image descriptor. In order to reduce the computational cost, we simplify by making all the diagonal elements in B to have the same value and all off-diagonal elements to be zeros. We split the training set into 10 roughly equal sized parts. For each setting of parameters and, using nine parts we fit the parameter, and calculate the retrieval precision on the remaining one part as the validation set. We repeat this procedure by using every part as the validation set in each of the 10 runs. Finally we get an average of the 10 precisions which corresponds to the setting. We choose the setting of which corresponds to the maximum average precision.

We first construct SVT by hierarchically clustering of local descriptors, i.e., SIFT descriptors. The SVT T consists of C=B L codewords, where B is the branch factor and L is the depth. Then, a query image q is represented by a bag of N q local descriptors X q ={x q,i },iN q . Each x q,i is traversed in T in all its leaves to find the kernel density f(c) for each leaf, resulting in the kernel density vector \(F(\mathbf {x}_{i})=\left \{ f\left (\mathbf {c}_{j}\right)\right \}_{j=1}^{B^{L}}\). Thus, a query image is eventually represented by a high-dimensional real-valued Bag-of-Words (BoW) histogram \(\mathbf {F}_{q}=[f_{q}\left (\mathbf {c}_{0}\right),f_{q}\left (\mathbf {c}_{1}\right),\ldots,f_{q}\left (\mathbf {c}_{B^{L}}\right)]\) obtained by calculating kernel density on each leaf node of T. The kernel density descriptors of the database images are denoted by \(\left \{ \mathbf {F}_{d_{m}}\right \}_{m=1}^{M}\), where M is the total number of database images.

Based on dense patch DAISY and dense patch SIFT, we have two sets of image signatures (representations): \(\left \{ \mathbf {F}_{q},\left \{ \mathbf {F}_{d_{m}}\right \}_{m=1}^{M}\right \} \) and \(\left \{ \mathbf {G}_{q},\left \{ \mathbf {G}_{d_{m}}\right \}_{m=1}^{M}\right \} \), respectively. In addition, the basic image signature (BoW histogram) based on dense patch SIFT is denoted as \(\left \{ \mathbf {H}_{q},\left \{ \mathbf {H}_{d_{m}}\right \}_{m=1}^{M}\right \} \). Image signatures H q , F q , and G q are B L-dimensional vectors, and typically B L ranges from 104 to 106, as suggested in [9]. We will show how the three types of image representations can be fused to achieve better image retrieval performance than any single image representation.

3.3 An improved Hausdorff metric for image matching

Denote X and Y for two sets of vectors: X={x i },i=1,2,,M, and Y={y i },i=1,2,,N, where x i and y i are both D-dimensional vectors. Then, the Hausdorff distance can be defined as a root mean square distance as follows:

$$ d_{\text{root}}(X,Y)=\sqrt{\frac{1}{|X|}\int\nolimits_{x\in X}d\left(x,Y\right)^{2}dX} $$

Conventionally, in order to reduce the computational complexity, the Hausdorff distance d 0(X,Y) is defined by

$$ d_{0}(X,Y)=sup\{\,\sup_{x\in X}\inf_{y\in Y}d(x,y),\,\sup_{y\in Y}\inf_{x\in X}d(x,y)\,\} $$

where sup represents the supremum and inf the infimum.

After the image representation, each image I can be represented by a B L-dimensional signature in real-valued domain \(R^{B^{L}}\). Denote the signature of the query image and a database image be X and Y, respectively. Then, the Hausdorff distance between X and Y can be regarded as the measure for the image retrieval. First, we define the distance d(x,Y) between a point x belonging to the set X and the set Y as:

$${} d\left(x,Y\right)={min}_{x\in X}\left\Vert x-y\right\Vert_{2} $$


$${} \begin{aligned} d\left(X,Y\right)&=\max\{\,{max}_{x\in X}{min}_{y\in Y}d(x,y),\,{max}_{y\in Y}\\ &\quad {min}_{x\in X}d(x,y)\,\} \end{aligned} $$

Informally, two sets are close in the Hausdorff distance if every point of either set is close to some point of the other set. The Hausdorff distance is the greatest of all the distances from a point in one set to the closest point in the other set. However, this basic metric is not robust since a few outliers will affect the max operation result. For example, if only one point yY is far away from all other points which are all similar in X and Y, then d(X,Y) will be large. This is likely to happen when a few visual words of SVT origin from the cluttered background, and the local SIFT descriptors of the query image may have high occurrences over these visual words, resulting in high Hausdorff distances to the database images which have the same foreground objects but different backgrounds.

In order to diminish the effect of the outliers, the directed distance d H(X,Y) of the proposed Hausdorff distance is proposed by replacing the Euclidean distance by the cost function:

$$ d_{\mathrm{H}}(X,Y)=\frac{1}{|X|}\sum_{x\in X}\gamma\left(d\left(x,Y\right)\right) $$

where |X| is the cardinality of the image signature X, and the cost function γ(t) is convex and symmetric and has a unique minimum value at zero. In our experiments, we use the cost function defined by

$$ \frac{d\gamma(t)}{dt}=k\cdot\gamma(t)\left(1-\frac{\gamma(t)}{\tau}\right),\gamma(0)=\gamma_{0} $$

By the above definition, when the distance d(x,Y) is small, γ(d(x,Y)) is very small, diminishing the effect of random noises; when the distance d(x,Y) is larger, γ(d(x,Y)) becomes large, reflecting the actual distances between the points; when the distance d(x,Y) is very large, γ(d(x,Y)) gets limited by a threshold, controlling the effect of the outliers that may dramatically increases the distance.

By solving the differential Eq. (9), we get

$$ \gamma(t)=\frac{\tau}{1+\left(\frac{\tau}{\gamma_{0}}-1\right)exp\left(-kt\right)} $$

where 0<γ 0<1 is set to be 0.1 experimentally, k=0.05, and τ=0.8 is a threshold to eliminate outliers, so the outliers yielding large distances are diminished. Since the matching performance depends on the parameter τ, it is important to determine it appropriately. If it is set to infinity, this proposed Hausdorff distance is equivalent to the conventional one. Because the cost function is associated with the distance value d(x,Y), the threshold value is selected experimentally. The function γ(t) is illustrated in Fig. 4.

Fig. 4
figure 4

Components of the calculation of the Hausdorff distance between the green line X and the blue line Y

3.4 Image match scoring

For the purpose of image retrieval, image descriptors need to be indexed by similarity scoring. The database images are denoted by \(\left \{ d^{m}\right \}_{m=1}^{M}\), where d m is a local descriptor, and M is the number of images in the database. Following the same VT quantization procedure, the local descriptors {y j } in d m are mapped to a high-dimensional sparse Bag-of-Words (BoW) histogram \(\mathbf {H}_{d}=[h_{1}^{d},\ldots,h_{C}^{d}]\). The images with highest similarity score sim(q,d m) between query q and database image d m are returned as the retrieval result. Conventionally, the similarity sim(q,d m) is defined in [1] as

$$ sim\left({\mathbf{H}_{q},{\mathbf{H}_{d^{m}}}}\right)=1-\left\Vert {\frac{{\mathbf{\mathbf{H}_{q}}\cdot\mathbf{{w}}}}{{\left\Vert {\mathbf{H}_{q}\cdot\mathbf{{w}}}\right\Vert }}-\frac{{{\mathbf{H}_{d^{m}}}\cdot\mathbf{{w}}}}{{\left\Vert {{\mathbf{H}_{d^{m}}}\cdot\mathbf{{w}}}\right\Vert }}}\right\Vert $$

where w=[w 1,w 2,,w C ], \({w_{i}}=\ln \frac {M}{{M_{i}}}\), and M i is the number of images in the database with at least one descriptor vector path through node i.

Since the vocabulary tree is very large, the number of images whose descriptors are quantized to a particular node is normally zero. In [1], the scalability is addressed by only comparing those database images indexed by each non-zero codeword for the given query image. Here, \( sim\left ({\mathbf {H}_{q},{\mathbf {H}_{d^{m}}}}\right),\,m=1,2,\ldots,M\) is normalized with a Sigmoid function to enhance the larger similarity scores:

$$ p_{m}=\frac{1}{1+e^{-\alpha {sim}_{m}}} $$

where α is the scaling parameter that is set to be 10, which produces the best performance according to preliminary experiments. Here, we use 10-fold cross-validation for determining the optimal parameters. We split the training set into 10 roughly equal sized parts. The setting of which corresponds to the maximum average precision is chosen.

3.5 Proposed information fusion

Let the node ensemble contain SVT nodes T i ,i=1,2,…,N, where N is the total number of SVT nodes, e.g., N=B L. By applying SVT approach to the query image, there are M similarity scores for the query image. An example for tag fusion of the three density score lists is illustrated in Fig. 5, where s 1, s 2 and s 3 are the three similarity score lists generated by three types of image representation, s is the final score list to be fused. After obtaining the final score list s, it is sorted in descending order and the top m, m<M nodes are suggested for the query image.

Fig. 5
figure 5

Overview of the tag fusion process

To integrate the three lists of scores, Dempster’s rule of combination [38] is utilized to combine different sources because it is considered to be a more flexible and general approach than the traditional probability theory and it is able to deal with some ignorance of the system. The basic probability assignment (BPA) function is used here to take into account all the available evidence, and is defined as a mapping S from the power set 2Ω of a finite set Ω={A 1,A 2,,A N } to [0,1] that for any T2Ω, and we have

$$ \sum_{T\in2^{\Omega}}S(T)=1,\,S(T)\ge0 $$

where the power set 2Ω comprises of exhaustive set of mutually exclusive elements:

$${} \begin{aligned} 2^{\Omega}&={\Huge\{}\left\{ {A_{1}}\right\},\cdots,\left\{ {A_{{N_{t}}}}\right\},\left\{ {{A_{1}},{A_{2}}}\right\},\cdots,\left\{ {{A_{{N_{t}}-1}},{A_{{N_{t}}}}}\right\},\cdots,\\ &\quad \left\{ {{A_{1}},\cdots,{A_{{N}}}}\right\},\phi{\Huge\}} \end{aligned} $$

where ϕ is the empty set, S(ϕ)=0 in a close-world assumption, and there are in total 2N elements in 2Ω.

Dempster’s rule for combining K sources is:

$$ S(T)=\frac{{{\sum}}_{T_{1},T_{2},\cdots,T_{K}{{\subset}}2^{\Omega},{{\cap}}_{i=1}^{K}T_{i}=T}\left(S_{1}(T_{1})\cdots S_{K}(T_{K})\right)} {{{\sum}}_{T_{1},T_{2},\cdots,T_{K}{{{\subset}}}2^{\Omega},{{\cap}}_{i=1}^{K}T{{\neq}}\phi}\left(S_{1}(T_{1})\cdots S_{K}(T_{K})\right)} $$

In this work, the joint BPA for three sources is reformulated as follows:

$$ S(T)=\sum_{T_{1},T_{2},T_{3}{{\subset}}2^{\Omega},T_{1}{{\cap}}T_{2}{{\cap}}T_{3}=T}\frac{S_{1}\left(T_{1}\right)S_{2}\left(T_{2}\right)S_{3}\left(T_{3}\right)}{1-M} $$

where \(M=\sum _{T_{1}{{\cap }}T_{2}{{\cap }}T_{3}=\phi }\left (S_{1}\left (T_{1}\right)S_{2}\left (T_{2}\right)S_{3}\left (T_{3}\right)\right)\) is a measure of the amount of conflict among the three BPA sets.

In order to satisfy the fast-response-time requirement of the image retrieval, instead of directly using Dempster’s rule of combination, we improve the online procedure. Originally, the BPAs need to be estimated on as many as 2N elements in the power set 2Ω, and the computational complexity is as high as O(2N) which is not affordable in real applications. Here, we reduce the original power set 2Ω to a much smaller subset

$$ P=\left\{ {\left\{ {A_{1}}\right\},\cdots,\left\{ {A_{{N_{t}}}}\right\},\left\{ {{A_{1}},{A_{2}}}\right\},\cdots,\left\{ {{A_{{N}-1}},{A_{{N}}}}\right\},\Phi}\right\} $$

where Φ is a subset of 2Ω which contains the elements with more than two SVT nodes. The reduced set P has N 2+1 elements so the computational complexity is O(N 2). The computational cost will be reduced substantially since N=B L here. So, the joint BPA can be simplified as follow:

$$ S(T)=\sum_{T_{1},T_{2},T_{3}{{\subset}}P,T_{1}{{\cap}}T_{2}{{\cap}}T_{3}=T}\frac{S_{1}\left(T_{1}\right)S_{2}\left(T_{2}\right)S_{3}\left(T_{3}\right)}{1-M} $$

where \(M=\sum _{T_{1}{{\cap }}T_{2}{{\cap }}T_{3}=\phi }\left (S_{1}\left (T_{1}\right)S_{2}\left (T_{2}\right)S_{3}\left (T_{3}\right)\right)\) and the BPAs on each element of P for each source can be formulated as follows:

$$ S_{t}\left(A_{i}\right)=S_{t}\left(\left\{ A_{i},A_{i},A_{i}\right\} \right)=m_{t}\left(\left\{ A_{i}\right\} \right)^{3} $$
$$ S_{t}\left(\left\{ A_{i},A_{j},A_{k}\right\} \right)=m_{t}\left(\left\{ A_{i}\right\} \right)m_{t}\left(\left\{ A_{j}\right\} \right)m_{t}\left(\left\{ A_{k}\right\} \right) $$

where A i Ω, t=1,2,3 and S(Φ) denote the ignorance on the power set, is set to be 0.15 empirically.

For any T t P, k=1,2, we can estimate S 1(T t ), S 2(T t ) and S 3(T t ) according to (19) and (20), and then we calculate S(A i ) for every A i Ω according to (18). Finally, we sort S(A i ), A i Ω descendingly and get the ranked top database image as the search result.

3.6 Proposed image search algorithm with robust Hausdorff distance

Using the proposed algorithms, a matching image list from the database will be generated from a query image. The algorithm is briefed in Algorithm 1.

4 Experimental results

4.1 Datasets

4.1.1 Oxford Building-11 Dataset

The Oxford Building Dataset [39] comprises of 5062 images collected from Flickr by searching for particular Oxford landmarks such as “All Souls Oxford” and “Christ Church Oxford”, as shown in Fig. 6. The collection is manually categorized into 11 different landmarks, and the query set contains 55 images. This is a challenging benchmark for object search due to occlusion and cluttered background.

Fig. 6
figure 6

Sample images from the Oxford Building-11 dataset

4.1.2 Corel natural image dataset

The natural landscape images used in this work include a total of 4798 images, as shown in Fig. 7, derived from Corel photo CDs. The dataset images are randomly partitioned in 7:3 ratio, for training set and test set, respectively.

Fig. 7
figure 7

Sample images from the Corel-48 dataset

4.1.3 PKU landmark dataset

We also test our proposed algorithms on the landmark benchmark from MPEG CDVS requirement subgroup [35], which contains 13,179 scene photos, organized into 198 landmark locations from the Peking University Campus. Sample images of PKU landmark-198 dataset are given in Fig. 8. The dataset images are randomly partitioned in two halves for training set and test set.

Fig. 8
figure 8

Sample images of PKU landmark-198 dataset

4.2 Experiment settings

For efficient dense patch SIFT descriptor extraction, we sampled on overlapping 16×16 pixel patches in space of 8 pixels [27] for all the algorithms on all the datasets. We adopt the standard DAISY setting as radius R=15, radius quantization levels Q=3, angular quantization levels T=8, and histogram quantization levels H=8 as in [33].

All images are resized to 640×480 resolution as a tradeoff between image retrieval efficiency and accuracy. In SVT-based approaches, the branch number is set to be 10 and vocabulary tree depth is set to be 6. According to the setting in [9], this setting produces satisfactory performance on large-scale image datasets. For matching of SVT histograms between the query images and database images, we adopt the intersection kernel [40] due to its efficiency. The multi-class classification method is the “one-against-all” method [41]. The simulation environments are given as follows: Ubuntu 14.04, Intel®; Core™ i7-3770S CPU @ 3.10 GHz x 8, 8-G RAM.

4.3 Preliminary performance evaluation

The proposed image retrieval framework has the flexibility to enhance in several stages: feature extraction, image representation and image matching. We test our proposed approach progressively in such a way: baseline SVT based on dense patch SIFT descriptors, baseline SVT with kernel density to obtain image signature, baseline SVT incorporated with robust Hausdorff distance, and baseline SVT incorporated with both kernel density and robust Hausdorff distance. To evaluate the retrieval performance, we report the retrieval rate in terms of the number of the top returned categories. The query image is regarded to be correctly recognized when its best matched image corresponds to one of the top n returned categories.

The comparison of the above progressive approaches in terms of different top n matched categories are shown in Tables 1, 2, and 3. We can observe that, incorporating the information in either retrieval stages will progressively improve the retrieval performance. It is obvious that integrating baseline SVT approach with kernel density and robust Hausdorff distance is the best choice.

Table 1 Comparison of various SVT approaches on Oxford Building-11 dataset
Table 2 Comparison of various SVT approaches on Corel-48 dataset
Table 3 Comparison of various SVT approaches on PKU dataset

Using the optimal SVT approach above, three image signatures (representations) can be derived: (i) signature (1st) based on histogram of dense-patch SIFT descriptors; (ii) signature (2nd) based on kernel density of dense-patch SIFT descriptors; and (iii) signature (3rd) based on histogram of dense-patch DAISY descriptors. The comparison of the above signatures as well as the fused result using Dempster’s rule of information fusion are shown in Tables 4, 5, and 6 in terms of different top n matched categories.

Table 4 Comparison of various image representations on Oxford Building-11 dataset
Table 5 Comparison of various image representations on Corel-48 dataset
Table 6 Comparison of various image representations on PKU-198 dataset

Figure 11 gives the performance comparison for PKU-198 dataset in terms of precision. By using the proposed fusion algorithm, the final retrieval results consistently outperforms the baseline SVT approach by about 15% in terms of retrieval precision.

We can observe that, the 1st signature, i.e., the scheme (c) in Tables 1, 2 and 3, produces moderate performance. The 2nd signature, i.e., the scheme (d), produces better performance due to the incorporation of the kernel density in signature generation stage. The 3rd signature, i.e., the scheme (d) embedded with dense-patch DAISY descriptor instead of dense patch SIFT descriptor, produces slightly inferior performance than SIFT. It is obvious that integrating all the three signatures using the proposed information fusion algorithm is the best choice, which is superior to the retrieval performance of any individual signature.

It is observed that the lowest retrieval performance is produced from PKU-198 dataset since the number of categories is large and geometrically different landmarks may have similar appearance. Oxford-11 produces relatively low performance although the number of categories is small, because of the challenging occlusions and changes in viewpoints and scales. Corel-48 is a medium scale dataset, which produces a moderate performance.

4.4 Objective performance comparisons

To further evaluate the retrieval performance, various methods for image retrieval are tested on the three datasets. The image search performance is evaluated in terms of precision. For each query image, the top matched image list generated by the retrieval system are compared with the ground-truth categories in the dataset. Denote the set of matched images returned by the retrieval system to be S and the set of ground-truth matched images from the database to be T, then precision is defined as

$$ Precision=\frac{|S{{\bigcap}}T|}{|S|} $$

where |·| denotes the cardinality of the set.

The methods include the baseline BoW approach in [42], the codeword uncertainty (BoW-UNC) method [42], Bosch’s hybrid BoW-pLSA method [43], baseline SVT approach [1], SVT-Earthmover (baseline SVT approach combined with Earth Mover’s Distance [44] for image matching), the proposed optimized SVT-Hausdorff approach (with kernel density to obtain SIFT signature and Hausdorff distance for image matching), and the proposed SVT-fusion approach with DS fusion (from all the three image matching score lists). Earth mover’s distance is a method to evaluate distance between two multi-dimensional distributions by linear programming [44]. UNC uses kernel density estimation to replace the hard-assignment BoW histogram, and reduces the effect of quantization. UNC-SVM and pLSA-SVM are both state-of-the-art methods to fuse generative models and discriminative models, where the parameters are optimized following the literature to achieve the highest recognition performance.

Figure 9 gives the performance comparison for Oxford-11 dataset in terms of precision. From Fig. 9, it is obvious that the proposed image retrieval approach consistently outperforms other approaches for various number of suggested tags. It is noticed that the UNC-SVM and pLSA-SVM approaches are consistently superior to the baseline BoW-SVM approach, but inferior to the baseline SVT approach. This indicates that SVT-based approaches are generally superior to BoW-based approach. The proposed image retrieval approach consistently outperforms other approaches for various number of suggested tags. By using the proposed fusion algorithm, the final retrieval results consistently outperforms the baseline SVT approach by about 13% in terms of retrieval precision.

Fig. 9
figure 9

Oxford-11 performance in terms of precision

Figure 10 gives the performance comparison for Corel-48 dataset in terms of precision. By using the proposed fusion algorithm, the final retrieval results consistently outperforms the baseline SVT approach by about 15% in terms of retrieval precision.

Fig. 10
figure 10

Corel-48 performance in terms of precision

From Figs. 9, 10, and 11, it is observed that the proposed modified Hausdorff distance can significantly improves baseline SVT approach by about 5%, and outperforms the conventional Earth Mover’s distance by about 1%.

Fig. 11
figure 11

PKU-198 performance in terms of precision

5 Conclusions

This paper presents a new framework for image retrieval, which has the sufficient flexibility to incorporate various enhancements based on kernel density, robust Hausdorff distance, and information fusion. They are carried out in the stages of image signature generation, image matching, and final scoring list, respectively. By embedding various signatures in the proposed framework, the final retrieval performance is superior to each individual approach. Experimental results show that the proposed framework significantly outperform state-of-the-art content-based image retrieval approaches. Future work may include the integrating other types of context information to the content analysis.


  1. D Nister, H Stewenius, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2. Scalable recognition with a vocabulary tree (IEEENew York, 2006), pp. 2161–2168.

    Google Scholar 

  2. Y Li, DJ Crandall, DP Huttenlocher, in IEEE 12th International Conference on Computer Vision. Landmark classification in large-scale image collections (IEEEKyoto, 2009), pp. 1957–1964.

    Google Scholar 

  3. Y LeCun, L Bottou, Y Bengio, P Haffner, Gradient-based learning applied to document recognition. Proc. IEEE. 86(11), 2278–2324 (1998).

    Article  Google Scholar 

  4. S Haykin, Neural networks and learning machines, vol. 3 (Pearson, NJ, 2009).

    Google Scholar 

  5. A Krizhevsky, I Sutskever, GE Hinton, in Advances in neural information processing systems. Imagenet classification with deep convolutional neural networks (Neural Information Processing Systems (NIPS)Lake Tahoe, 2012), pp. 1097–1105.

    Google Scholar 

  6. AV Singh, Content-based image retrieval using deep learning. Thesis, Rochester Institute of Technology. (2015).

  7. V-A Nguyen, MN Do, in IEEE International Conference on Multimedia and Expo (ICME). Deep learning based supervised hashing for efficient image retrieval (IEEESeattle, 2016), pp. 1–6.

    Chapter  Google Scholar 

  8. A Gordo, J Almazan, J Revaud, D Larlus, Deep image retrieval: learning global representations for image search (2016). arXiv preprint arXiv:1604.01325.

  9. B Girod, V Chandrasekhar, DM Chen, et al, Mobile visual search. IEEE Signal Proc. Mag. 28(4), 61–76 (2011).

    Article  Google Scholar 

  10. Z Li, K-H Yap, Content and context boosting for mobile landmark recognition. IEEE Signal Process. Lett. 19(8), 459–462 (2012).

    Article  Google Scholar 

  11. K-H Yap, Z Li, D-J Zhang, Z-K Ng, in Proceedings of the 20th ACM International Conference on Multimedia. Efficient mobile landmark recognition based on saliency-aware scalable vocabulary tree (ACMNara, 2012), pp. 1001–1004.

    Chapter  Google Scholar 

  12. Z Li, K-H Yap, An efficient approach for scene categorization based on discriminative codebook learning in bag-of-words framework. Image Vision Comput. 31(10), 748–755 (2013).

    Article  Google Scholar 

  13. Z Li, K-H Yap, Context-aware discriminative vocabulary tree learning for mobile landmark recognition. Dig. Signal Process. 24:, 124–134 (2014).

    Article  Google Scholar 

  14. DP Huttenlocher, G Klanderman, WJ Rucklidge, et al, Comparing images using the Hausdorff distance. IEEE Trans. Pattern Anal. Mach. Intell. 15(9), 850–863 (1993).

    Article  Google Scholar 

  15. W Rucklidge, Efficient visual recognition using the Hausdorff distance, vol. 1173 (Springer-Verlag, Secaucus, 1996).

    Book  MATH  Google Scholar 

  16. O Jesorsky, KJ Kirchberg, RW Frischholz, in Audio-and video-based biometric person authentication. Robust face detection using the Hausdorff distance (SpringerBerlin, 2001), pp. 90–95.

    Chapter  Google Scholar 

  17. DG Lowe, in The Proceedings of the Seventh IEEE International Conference on Computer Vision, 2. Object recognition from local scale-invariant features (IEEEKerkyra, 1999), pp. 1150–1157.

    Chapter  Google Scholar 

  18. DG Lowe, Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004).

    Article  Google Scholar 

  19. TJ Chin, H Goh, JH Lim, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. Boosting descriptors condensed from video sequences for place recognition (IEEEAnchorage, 2008), pp. 1–8.

    Google Scholar 

  20. A Pronobis, B Caputo, in IEEE/RSJ International Conference on Intelligent Robots and Systems. Confidence-based cue integration for visual place recognition (IEEESan Diego, 2007), pp. 2394–2401.

    Google Scholar 

  21. A Qamra, EY Chang, Scalable landmark recognition using extent. Multimedia Tools Appl. 38(2), 187–208 (2008).

    Article  Google Scholar 

  22. K Mikolajczyk, C Schmid, A performance evaluation of local descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 27(10), 1615–1630 (2005).

    Article  Google Scholar 

  23. L Van Gool, T Moons, D Ungureanu, Affine/photometric invariants for planar intensity patterns. Eur. Conf. Comput. Vis. 1064:, 642–651 (1996).

    Google Scholar 

  24. S Belongie, G Mori, J Malik, Matching with shape contexts. Stat. Anal. Shapes Part Ser. Model. Simul. Sci. Eng. Technol, 81–105 (2006).

  25. L Bo, X Ren, D Fox, in Advances in neural information processing systems. Hierarchical matching pursuit for image classification: architecture and fast algorithms (Neural Information Processing Systems (NIPS)Granada, 2011), pp. 2115–2123.

    Google Scholar 

  26. P Dreuw, P Steingrube, H Hanselmann, H Ney, G Aachen, in British Machine Vision Conference. Surf-face: face recognition under viewpoint consistency constraints (BMVA PressLondon, 2009), pp. 1–11.

    Google Scholar 

  27. S Lazebnik, C Schmid, J Ponce, in IEEE Conference on Computer Vision and Pattern Recognition, 2. Beyond bags of features: spatial pyramid matching for recognizing natural scene categories (IEEENew York, 2006), pp. 2169–2178.

    Google Scholar 

  28. DM Chen, G Baatz, K Koser, et al, in IEEE Conference on Computer Vision and Pattern Recognition. City-scale landmark identification on mobile devices (IEEEColorado, 2011), pp. 737–744.

    Google Scholar 

  29. G Baatz, K Köser, D Chen, R Grzeszczuk, M Pollefeys, Leveraging 3d city models for rotation invariant place-of-interest recognition. Int. J. Comput. Vis. 96(3), 315–334 (2012).

    Article  Google Scholar 

  30. G Baatz, K Köser, D Chen, R Grzeszczuk, M Pollefeys, in European Conference on Computer Vision. Handling urban location recognition as a 2d homothetic problem (SpringerCrete, 2010), pp. 266–279.

    Google Scholar 

  31. J Zhang, M Marszalek, S Lazebnik, C Schmid, in IEEE Conference on Computer Vision and Pattern Recognition Workshop. Local features and kernels for classification of texture and object categories: a comprehensive study (IEEENew York, 2006), pp. 13–13.

    Google Scholar 

  32. E Tola, V Lepetit, P Fua, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR). A fast local descriptor for dense matching (IEEEAnchorage, 2008), pp. 1–8.

    Google Scholar 

  33. E Tola, V Lepetit, P Fua, Daisy: an efficient dense descriptor applied to wide-baseline stereo. IEEE Trans. Pattern Anal. Mach. Intell. 32(5), 815–830 (2010).

    Article  Google Scholar 

  34. N Khan, B McCane, S Mills, Better than sift?Mach. Vis. Appl. 26(6), 819–836 (2015).

    Article  Google Scholar 

  35. R Ji, LY Duan, J Chen, et al, in 18th IEEE International Conference on Image Processing. Pkubench: a context rich mobile visual search benchmark (IEEEBrussels, 2011), pp. 2545–2548.

    Google Scholar 

  36. DW Scot, Multivariate density estimation (Wiley & Sons, New York, 1992).

    Book  Google Scholar 

  37. L Greengard, X Sun, A new version of the fast gauss transform. Doc. Math. 3:, 575–584 (1998).

    MathSciNet  MATH  Google Scholar 

  38. K Sentz, S Ferson, Combination of evidence in Dempster-Shafer theory, vol. 4015 (Sandia National Laboratories, Albuquerque, 2002).

    Book  Google Scholar 

  39. J Philbin, A Zisserman, The Oxford Buildings Dataset. (2007). Accessed 18 Feb 2017.

  40. MJ Swain, DH Ballard, Color indexing. Int. J. Comput. Vis. 7(1), 11–32 (1991).

    Article  Google Scholar 

  41. Y Liu, YF Zheng, in IEEE International Joint Conference on Neural Networks, 2. One-against-all multi-class svm classification using reliability measures (IEEEMontreal, 2005), pp. 849–854.

    Google Scholar 

  42. JC van Gemert, CJ Veenman, et al, Visual word ambiguity. IEEE Trans. Pattern Anal. Mach. Intell. 32(7), 1271–1283 (2010).

    Article  Google Scholar 

  43. A Bosch, A Zisserman, X Muoz, Scene classification using a hybrid generative/discriminative approach. IEEE Trans. Pattern Anal. Mach. Intell. 30(4), 712–727 (2008).

    Article  Google Scholar 

  44. Y Rubner, C Tomasi, LJ Guibas, in Sixth International Conference on Computer Vision (ICCV). A metric for distributions with applications to image databases (IEEEBombay, 1998), pp. 59–66.

    Google Scholar 

Download references


This study was supported by the National Natural Science Foundation of China (61401126), National Natural Science Foundation of China (F011702), Natural Science Foundation of Heilongjiang Province of China (QC2015083), and Heilongjiang Postdoctoral Financial Assistance (LBH-Z14121).

Authors’ contributions

The authors declare equal contribution. All authors read and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Xiaoming Sun.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Che, C., Yu, X., Sun, X. et al. Image retrieval by information fusion based on scalable vocabulary tree and robust Hausdorff distance. EURASIP J. Adv. Signal Process. 2017, 21 (2017).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Image retrieval
  • Hausdorff distance
  • Information fusion
  • Scalable vocabulary tree