- Open Access
Multi-modal image matching based on local frequency information
© Liu et al.; licensee Springer. 2013
- Received: 18 September 2012
- Accepted: 18 December 2012
- Published: 8 January 2013
This paper challenges the issue of matching between multi-modal images with similar physical structures but different appearances. To emphasize the common structural information while suppressing the illumination and sensor-dependent information between multi-modal images, two image representations namely Mean Local Phase Angle (MLPA) and Frequency Spread Phase Congruency (FSPC) are proposed by using local frequency information in Log-Gabor wavelet transformation space. A confidence-aided similarity (CAS) that consists of a confidence component and a similarity component is designed to establish the correspondence between multi-modal images. The two representations are both invariant to contrast reversal and non-homogeneous illumination variation, and without any derivative or thresholding operation. The CAS that integrates MLPA with FSPC tightly instead of treating them separately can more weight the common structures emphasized by FSPC, and therefore further eliminate the influence of different sensor properties. We demonstrate the accuracy and robustness of our method by comparing it with those popular methods of multi-modal image matching. Experimental results show that our method improves the traditional multi-modal image matching, and can work robustly even in quite challenging situations (e.g. SAR & optical image).
- Multi-modal image
- Image matching
- Image representation
- Local frequency information
- Wavelet transformation
- Similarity measure
Image matching that aims to find the corresponding features or image patches between two images of the same scene is often a fundamental issue in computer vision. It has been widely used in vision navigation , target recognition and tracking , super-resolution , 3-D reconstruction , pattern recognition , medical image processing , etc. In this paper, we focus on the issue of matching for multi-modal (or multi-sensor) images that differ in relation to the type of visual sensor. There are many important issues that make multi-modal image matching a very challenging problem . First, multi-modal images are captured using different visual sensors (e.g. SAR, optical, infrared, etc.) at different time. Second, images with different modalities are normally mapped to different intensity values. This makes it difficult to measure similarity based on their intensity values since the same content may be represented by different intensity values. The problem is further complicated by the fact that various intrinsic and extrinsic sensing conditions may lead to image non-homogeneity. Finally, the disparity between the intensity values of multi-modal images can lead to coincidental local intensity matches between non-corresponding content, which may make the algorithm difficult to search the correct solution. Hence, the focuses of multi-modal image matching reside in illumination (contrast and brightness) invariant representations, common structure extraction from varying conditions and robust similarity measure.
The existing approaches for multi-modal image matching can be generally classified as feature-based and region-based. Feature-based matching utilizes extracted features to establish correspondence. Interest points [8, 9], edges , etc. are often used as the local features because of their robustness in extraction and matching. In , Scale Invariant Feature Transform (SIFT) and cluster reward algorithm (CRA)  are used to match multi-modal remote sensing images. The SIFT operator is first adopted to extract feature points and perform coarse match, and then the CRA similarity measure is used to achieve accurate correspondence. In , Yong et al. propose the algorithm for multi-source image matching based on information entropy which comprehensively considers of the intensity information and the edge direction information. For feature-based methods two requirements must be satisfied: (i) features are extracted robustly and (ii) feature correspondences are established reliably. Failure to meet either of them will cause this type of method to fail. In contrast to feature-based methods, region-based methods make use of the whole image content to establish correspondence. While most approaches use features for image matching, there is also a significant amount of work on region-based matching. In , local phase-coherence representation is constructed for multi-modal image matching. This representation has some merits that make it a promising candidate for handling situations where non-homogeneous image contrast exists: (i) it is relatively insensitive to the level of signal energy; (ii) it depends on the structures in the image and can emphasize the edges and ridges at the same time; and (iii) it has a good localization in the spatial domain. In , M. Irani et al. present an energy-image representation based on directional-derivative filters. A set of filters, oriented in the horizontal, vertical, and the two diagonal directions, are applied to the raw image, and then the derivative image is squared to get an “energy” image. Thus, the directional information is preserved in this energy representation. This approach, however, requires explicit directional filters and explicit filtering with Gaussian functions to create a pyramid. In addition, mutual information that has been commonly used and showed great promise in medical image processing is often adopted as the similarity measure for multi-modal image matching since it is insensitive to variation of intensities and doesn’t require knowledge of the relationship (joint intensity distribution) of the two different modalities [14, 15]. The main merit of region-based method is their ability of resistance against noise and texture distortions since abundant information can be adopted by using a relatively large template, and thus providing a high matching accuracy.
In this paper, we bring forward a local frequency information-based matching frame for multi-modal images. It takes advantage of the merits of both MLPA and FSPC by using the CAS, and can be used to match images captured by similar as well as different types of sensors at different time.
The visual system of human can reliably recognize the same object/scene under widely varying conditions. If the illumination of a scene is changed by several orders of magnitude, our interpretation for it can keep unchanged largely. Thus, in the image matching the main form of invariance is invariance to illumination, this is particularly important for multi-modal images where non-homogeneous contrast and brightness variation frequently occur. In this work, the local frequency information is used to construct image representations namely FSPC and MLPA, which are both dimensionless and invariant to non-homogeneous illumination variation and contrast reversal, for multi-modal image matching.
2.1. Log-Gabor function
To preserve phase information, linear-phase filters that are nonorthogonal and in symmetric/anti-symmetric quadrature pairs should be used. In , J. Liu et al. use Gabor filters that can be tuned to any desired frequency or orientation and offer simultaneous localization of spatial and frequency information to construct local-frequency representation for multi-modal images. However, Gabor function cannot maintain a zero DC component for bandwidths over one octave. Log-Gabor filters have all the merits of Gabor filters and additionally allow constructing arbitrarily large bandwidth filters while still maintaining a zero DC component in the even-symmetric filter. Hence, in this work we prefer to use Log-Gabor filters that have a Gaussian transfer function when viewed on the logarithmic frequency scale, instead of Gabor filters, as the basis of our local frequency creation .
where (r, θ) represents the polar coordinates. As we can see from the definition formulas, the Log-Gabor filter is primarily determined by four parameters: f 0, θ 0, σ r and σ θ , where f 0 and θ 0 correspond to the center frequency and orientation angle, σ r and σ θ determine the scale and angular bandwidth respectively. The filter bank needs to make the transfer function of each filter overlap sufficiently with its neighbors so that the sum of all the transfer function forms a relatively uniform coverage of the spectrum.
2.2. Local frequency representations
For multi-modal images, the signals are correlated primarily in high-frequency information, while correlation between the signals tends to degrade with the reduction of high-frequency information . This is because high-frequency information (e.g. edge, contour, corner, junction, etc.) normally corresponds to the physical structure that is common to images with different modalities. On the other hand, low-frequency information depends heavily on the illumination and the photometric and physical imaging properties of sensors, and these are substantially different in multi-modal images. To capture the common physical structure, the high-pass filters (e.g. Sobel, Prewitt, Laplacian, etc.) that are working in spatial domain are reasonably adopted [10, 13]. Those methods are straightforward and quite fast to compute. However, they normally depend on the intensity gradient information which highly relates with local image contrast, and therefore the non-homogeneous variation of contrast may degrade the performance of algorithm.
where E(x) denotes the energy that is the magnitude of a vector sum. As we can see in the definition formula, phase congruency is the ratio of the energy E(x) to the overall length taken by the local frequency components in reaching the end point. If all the local frequency components are in phase, all the response vectors would be aligned and the value of phase congruency, PC1, would be a maximum of 1. If there is no coherent of phase, the value of PC1 falls to a minimum of 0. Phase congruency is a quantity that is independent of the overall magnitude of the signal making it invariant to variation of image brightness and contrast.
Weighting by frequency spread has benefit of reducing those ill-conditioned responses that have the low frequency spread, as well as improving the localization accuracy of features, especially the smoothed features whose responses are normally uniform . In addition, the noise resistance is also improved to some extent since the responses of noise are normally skewed to the high frequency end, and therefore have the relatively narrow frequency spectrums.
Having obtained the local frequency representations, we then use them to perform matching operations. As we can see from the definitions of local frequency representations, MLPA primarily represents the phase information of local frequency, whereas FSPC mainly utilizes the amplitude information, which means MLPA and FSPC can be compensated each other to some extent since information independence. Hence, by using some proper fusion scheme that makes best use of the merits of MLPA and FSPC, one can achieve better matching performance. For example, the only use of MLPA may induce errors particularly in the texture-less image regions where FSPC normally has quite small value since the lack of significant features. In addition, it may be difficult to distinguish between two search windows that have similar MLPA but different FSPC.
where d = − |MLPA 1 − MLPA 2|, . d is the similarity component that reflects how well the two signals resemble each other, and c is the confidence component that reflects the confidence that a match is correct.
MLPA with low FSPC is normally less reliable than those with high FSPC. Therefore, it is important to give more confidence to the higher FSPC. In fact, the confidence component is the mean value of the two FSPCs, so the confidence highly relates with the significance of signals and will be given a larger value when both signals are significant. Hence, CAS 0 is normally given a relatively large value when the two pixels are both similar and significant and a relatively small value when they are not.
This measure returns 0 when the matching windows are identical. The denominator, C, is in fact related to the confidence component. For a same value of similarity D, the definition of CAS indicates a similarity is larger as the associated confidence components are high. It is apparent that CAS is invariant for the global linear illumination transformations: I→αI+b.
The values for the method parameters used in the experiments
Number of wavelet scales
Number of wavelet orientations
Wavelength of smallest scale filter
Scaling factor between successive filters
Cut-off value “c”
Gain factor “λ”
Comparisons of accuracy rates obtained from different methods
4.1. Illumination invariant property
where f and g denote the raw and synthetic images respectively. From Eq. 20, we can see that the value of NCC is highly related with the degree of non-homogeneous illumination variation. If there does not exist any non-homogeneous illumination variation, NCC will be given a maximum value of 1. The image of Figure 3 shows the results of numerical evaluation for gray-level images of Figure 4, MLPAs of Figure 2, and FSPCs of Figure 5. As we can see, the NCC values of MLPA and FSPC almost keep invariant to the non-homogeneous illumination variation, although the NCC values of gray-scale images are fluctuant with the varying degree of non-homogeneous illumination variation. The homogeneous illumination variation that can be considered as a type of non-homogeneous illumination variation is not particularly validated in this work. From the visual and numerical validation, we can clearly achieve the conclusion that both MLPA and FSPC can well keep invariant to non-homogeneous illumination validation.
4.2. Evaluation using synthetic images
4.3. Matching accuracy evaluation using real images
It should be noted that the proposed method performs better than MI. The underlying assumption of MI is that the statistical relationship between the matching images is homogeneous over the whole image domain. It is normally true when intensities mapping between matching images is global and highly correlated or when structures with different intensities in one image have similar intensities in the other image, e.g. bond and background in CT and MR. However, the statistical relationships of intensities between multi-modal image pairs are normally not global and non-homogeneous as discussed above, which are quite different from the medical images. Therefore, MI may not be sufficient for matching multi-modal images. In addition, the absence of local spatial information in MI also weakens the matching robustness to some extent.
Since symmetries are a potentially robust and stable feature of many man-made and natural scenes, which makes it suitable to represent multi-modal images, LSS designed for scoring local symmetries whose performance is almost compatible with PC works reasonably well in our experiments, although its primary goal is to extract local features from images of architectural scenes.
From the evaluation using synthetic and real images, we can achieve the conclusion: since the considerations of noise resistance, illumination adaptability and common structure extraction and weighting, the proposed method can achieve higher accuracy rate, better matching confidence than the conventional methods for the test images used.
To achieve robust multi-modal image match, we first present two image representations—FSPC and MLPA based on the Log-Gabor wavelet transformation, and then design the CAS that combines confidence and similarity by using the information of FSPC and MLPA to find the correspondence. The proposed method has three main merits: (1) both MLPA and FSPC keep invariant for non-homogeneous illumination (contrast, brightness) variation and contrast reversal that frequently occur between multi-modal images; (2) FSPC can effectively capture the common scene structural information while suppressing the non-common sensor-dependent properties; (3) As the confidence factor, the structural information extracted by FSPC can be allocated more weighting softly by CAS. In addition, the proposed method is threshold-free, and therefore can retain as much image detail information as possible to resist noise influence and scene distortions between images. Experiments using numerous real and synthetic images demonstrate that our method can match multi-modal images robustly. Through comparison experiments, we also demonstrate the advantage of our method over the conventional methods. In the future, we plan to introduce the geometric transformation into our matching frame, and extend our method to image alignment.
This work was partly supported by Oulu University, Finland. The authors would like to thank Prof. Janne Heikkila, Dr. Jie Chen and Guoying Zhao for their contributions. The authors also want to express their gratitude to the anonymous reviewers whose thoughtful comments and suggestions improved the quality of the article.
- Conte G, Doherty P: Vision-based unmanned aerial vehicle navigation using Geo-referenced information. EURASIP J. Adv. Sig. Process. 2009, 2009: 1-18.View ArticleGoogle Scholar
- Kalal Z, Mikolajczyk K, Matas J: Tracking-learning-detection. IEEE Trans. Pattern Anal. Mach. Intel. 2010, 6(1):1-14.Google Scholar
- Vandewalle P, Susstrunk S, Vetterli M: A frequency domain approach to registration of aliased images with application to super-resolution. EURASIP J. Adv. Sig. Process. 2006, 2006: 1-14.View ArticleGoogle Scholar
- Brown M, Lowe D: “Unsupervised 3D object recognition and reconstruction in unordered datasets,” in proc. Int. Conf. 3-D digit. Imag. Model 2005, 56-63.Google Scholar
- Yingzi D, Craig B, Zhi Z: “Scale invariant Gabor descriptor-based noncooperative iris recognition. EURASIP J. Adv. Sig. Process. 2010, 2010: 1-13.Google Scholar
- Yang Y, Dong Sun P, Shuying H, Nini R: “Medical image fusion via an EffectiveWavelet-based approach. EURASIP J. Adv. Sig. Process. 2010, 2010: 1-13.View ArticleGoogle Scholar
- Wong A, Orchard J: Robust multi-modal registration using local phase-coherence representations. J. Sign. Process. Syst. 2009, 54: 89-100. 10.1007/s11265-008-0202-xView ArticleGoogle Scholar
- Yingdan WU, Yang MING: “A multi-sensor remote sensing image matching method based on SIFT operator and CRA similarity measure”. Proceedings of 2011 International Conference on Intelligence Science and Information Engineering 2011, 115-118.Google Scholar
- Sasa W, Zhenbing Z, Ping Y, Zejing G: “Infrared and visible image matching algorithm based on NSCT and DAISY”. Proceedings of 2011 4th International Congress on Image and Signal Processing 2011, 4: 2072-2075.Google Scholar
- Yong S, Jae H, Jong B: Multi-sensor image registration based on intensity and edge orientation information. Pattern Recognition 2008, 41: 3356-3365. 10.1016/j.patcog.2008.04.017View ArticleGoogle Scholar
- Inglada J: “Similarity measures for multi-sensor remote sensing images”. Proceedings of Geoscience and Remote Sensing Symposium, Toulouse 2001, 5236: 182-189.Google Scholar
- Kovesi P: “Image correlation from local frequency information”. Proceedings of the Australian Pattern Recognition Society Conference 1995, 1995: 336-341.Google Scholar
- Irani PAM: “Robust multi-sensor image alignment”. Proceedings of the 6th International Conference on Computer Vision 1998, 959-966.Google Scholar
- Josien PW, Pluim JB, Antoine M, Viergever MA: “Mutual-information-based registration of medical images: a survey. IEEE Trans. Med. Imag. 2003, 22(8):986-1004. 10.1109/TMI.2003.815867View ArticleGoogle Scholar
- Estévez PA, Tesmer M, Perez CA, Zurada JM: Normalized mutual information feature selection. IEEE Trans. Neural Netw. 2009, 20(2):189-201.View ArticleGoogle Scholar
- Liu J, Vemuri BC, Bova F: Efficient multi-modal image registration using local-frequency maps. Mach. Vis. Appl. 2002, 13: 149-163. 10.1007/s001380100072View ArticleGoogle Scholar
- Morlet J, Arens G, Fourgeau E, Giard D: Wave propagation and sampling theory-part II: sampling theory and complex waves. Geophysics 1982, 47(2):222-236. 10.1190/1.1441329View ArticleGoogle Scholar
- Morrone MC, Ross JR, Burr DC, Owens RA: Mach bands are phase dependent. Nature 1986, 324(6094):250-253. 10.1038/324250a0View ArticleGoogle Scholar
- Morrone MC, Owens RA: Feature detection from local energy. Pattern. Recognit. Lett. 1987, 6: 303-313. 10.1016/0167-8655(87)90013-4View ArticleGoogle Scholar
- Kovesi P: Phase congruency: a low-level image invariant. Psychol. Res. 2000, 64: 136-148. 10.1007/s004260000024View ArticleGoogle Scholar
- Elbakary MI, Sundareshan MK: Multi-modal image registration using local frequency representation and computer-aided design (CAD) models. Image Vis. Comput. 2007, 25: 663-670. 10.1016/j.imavis.2006.05.009View ArticleGoogle Scholar
- Zheng L, Robert L`r: “Phase congruence measurement for image similarity assessment”. Pattern. Recognit. Lett. 2007, 28: 166-172. 10.1016/j.patrec.2006.06.019View ArticleGoogle Scholar
- Daniel Cabrini H, Noah S: “Image matching using local symmetry features”. Proc. CVPR 2012, 206-213.Google Scholar
- Viola P, Wells WM: “Alignment by maximization of mutual information”. Proc. ICCV 1995, 16-23.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.