# Multi-modal image matching based on local frequency information

- Xiaochun Liu
^{1, 2}Email author, - Zhihui Lei
^{1, 2}, - Qifeng Yu
^{1, 2}, - Xiaohu Zhang
^{1, 2}, - Yang Shang
^{1, 2}and - Wang Hou
^{1, 2}

**2013**:3

https://doi.org/10.1186/1687-6180-2013-3

© Liu et al.; licensee Springer. 2013

**Received: **18 September 2012

**Accepted: **18 December 2012

**Published: **8 January 2013

## Abstract

This paper challenges the issue of matching between multi-modal images with similar physical structures but different appearances. To emphasize the common structural information while suppressing the illumination and sensor-dependent information between multi-modal images, two image representations namely Mean Local Phase Angle (MLPA) and Frequency Spread Phase Congruency (FSPC) are proposed by using local frequency information in Log-Gabor wavelet transformation space. A confidence-aided similarity (CAS) that consists of a confidence component and a similarity component is designed to establish the correspondence between multi-modal images. The two representations are both invariant to contrast reversal and non-homogeneous illumination variation, and without any derivative or thresholding operation. The CAS that integrates MLPA with FSPC tightly instead of treating them separately can more weight the common structures emphasized by FSPC, and therefore further eliminate the influence of different sensor properties. We demonstrate the accuracy and robustness of our method by comparing it with those popular methods of multi-modal image matching. Experimental results show that our method improves the traditional multi-modal image matching, and can work robustly even in quite challenging situations (e.g. SAR & optical image).

### Keywords

Multi-modal image Image matching Image representation Local frequency information Wavelet transformation Similarity measure## 1. Introduction

Image matching that aims to find the corresponding features or image patches between two images of the same scene is often a fundamental issue in computer vision. It has been widely used in vision navigation [1], target recognition and tracking [2], super-resolution [3], 3-D reconstruction [4], pattern recognition [5], medical image processing [6], etc. In this paper, we focus on the issue of matching for multi-modal (or multi-sensor) images that differ in relation to the type of visual sensor. There are many important issues that make multi-modal image matching a very challenging problem [7]. First, multi-modal images are captured using different visual sensors (e.g. SAR, optical, infrared, etc.) at different time. Second, images with different modalities are normally mapped to different intensity values. This makes it difficult to measure similarity based on their intensity values since the same content may be represented by different intensity values. The problem is further complicated by the fact that various intrinsic and extrinsic sensing conditions may lead to image non-homogeneity. Finally, the disparity between the intensity values of multi-modal images can lead to coincidental local intensity matches between non-corresponding content, which may make the algorithm difficult to search the correct solution. Hence, the focuses of multi-modal image matching reside in illumination (contrast and brightness) invariant representations, common structure extraction from varying conditions and robust similarity measure.

The existing approaches for multi-modal image matching can be generally classified as feature-based and region-based. Feature-based matching utilizes extracted features to establish correspondence. Interest points [8, 9], edges [10], etc. are often used as the local features because of their robustness in extraction and matching. In [8], Scale Invariant Feature Transform (SIFT) and cluster reward algorithm (CRA) [11] are used to match multi-modal remote sensing images. The SIFT operator is first adopted to extract feature points and perform coarse match, and then the CRA similarity measure is used to achieve accurate correspondence. In [10], Yong *et al.* propose the algorithm for multi-source image matching based on information entropy which comprehensively considers of the intensity information and the edge direction information. For feature-based methods two requirements must be satisfied: (i) features are extracted robustly and (ii) feature correspondences are established reliably. Failure to meet either of them will cause this type of method to fail. In contrast to feature-based methods, region-based methods make use of the whole image content to establish correspondence. While most approaches use features for image matching, there is also a significant amount of work on region-based matching. In [12], local phase-coherence representation is constructed for multi-modal image matching. This representation has some merits that make it a promising candidate for handling situations where non-homogeneous image contrast exists: (i) it is relatively insensitive to the level of signal energy; (ii) it depends on the structures in the image and can emphasize the edges and ridges at the same time; and (iii) it has a good localization in the spatial domain. In [13], M. Irani *et al.* present an energy-image representation based on directional-derivative filters. A set of filters, oriented in the horizontal, vertical, and the two diagonal directions, are applied to the raw image, and then the derivative image is squared to get an “energy” image. Thus, the directional information is preserved in this energy representation. This approach, however, requires explicit directional filters and explicit filtering with Gaussian functions to create a pyramid. In addition, mutual information that has been commonly used and showed great promise in medical image processing is often adopted as the similarity measure for multi-modal image matching since it is insensitive to variation of intensities and doesn’t require knowledge of the relationship (joint intensity distribution) of the two different modalities [14, 15]. The main merit of region-based method is their ability of resistance against noise and texture distortions since abundant information can be adopted by using a relatively large template, and thus providing a high matching accuracy.

In this paper, we bring forward a local frequency information-based matching frame for multi-modal images. It takes advantage of the merits of both MLPA and FSPC by using the CAS, and can be used to match images captured by similar as well as different types of sensors at different time.

## 2. Image representations via local frequency information

The visual system of human can reliably recognize the same object/scene under widely varying conditions. If the illumination of a scene is changed by several orders of magnitude, our interpretation for it can keep unchanged largely. Thus, in the image matching the main form of invariance is invariance to illumination, this is particularly important for multi-modal images where non-homogeneous contrast and brightness variation frequently occur. In this work, the local frequency information is used to construct image representations namely FSPC and MLPA, which are both dimensionless and invariant to non-homogeneous illumination variation and contrast reversal, for multi-modal image matching.

### 2.1. Log-Gabor function

To preserve phase information, linear-phase filters that are nonorthogonal and in symmetric/anti-symmetric quadrature pairs should be used. In [16], J. Liu *et al.* use Gabor filters that can be tuned to any desired frequency or orientation and offer simultaneous localization of spatial and frequency information to construct local-frequency representation for multi-modal images. However, Gabor function cannot maintain a zero DC component for bandwidths over one octave. Log-Gabor filters have all the merits of Gabor filters and additionally allow constructing arbitrarily large bandwidth filters while still maintaining a zero DC component in the even-symmetric filter. Hence, in this work we prefer to use Log-Gabor filters that have a Gaussian transfer function when viewed on the logarithmic frequency scale, instead of Gabor filters, as the basis of our local frequency creation [17].

where (*r*, *θ*) represents the polar coordinates. As we can see from the definition formulas, the Log-Gabor filter is primarily determined by four parameters: *f*
_{0}, *θ*
_{0}, *σ*
_{
r
} and *σ*
_{
θ
}, where *f*
_{0} and *θ*
_{0} correspond to the center frequency and orientation angle, *σ*
_{
r
} and *σ*
_{
θ
} determine the scale and angular bandwidth respectively. The filter bank needs to make the transfer function of each filter overlap sufficiently with its neighbors so that the sum of all the transfer function forms a relatively uniform coverage of the spectrum.

### 2.2. Local frequency representations

*I*denote the signal, and

*LG*

_{ n,θ }

^{ e }and

*LG*

_{ n,θ }

^{ o }denote the even-symmetric and odd-symmetric component of Log-Gabor function at the scale

*n*and orientation

*θ*. The response vector formed by the responses of each quadrature pair of filters can be expressed as

*e*

_{ n,θ }(

*x*) and

*o*

_{ n,θ }(

*x*) can be regarded as real and imaginary parts of complex valued frequency component. The amplitude of the response vector at the scale

*n*and orientation

*θ*is given by

*x*of a signal, we will have an array of these response vectors (each vector corresponds to one scale and orientation of filter). The response vectors form the basis of the proposed representations. The MLPA can be calculated as follow:

*F(x)*and

*H(x)*can be calculated by summing the even and odd filter convolutions:

*F(x), H(x)*] that locates at the third/fourth quadrant (where αtan2(

*F*(

*x*),

*H*(

*x*))<0) to the first/second quadrant (where αtan2(

*F*(

*x*),

*H*(

*x*))≥0). Each value of MLPA, which is independent of the overall energy of the signal, is a measure of mean local phase angle. Hence, all MLPA maps have the same units, and are invariant to both scale and offset illumination changes (e.g. Figures 2 and 3). The main goal of MLPA is to eliminate the variation of intensity values between corresponding pixels of multi-modal image pair by using the phase information of local frequency. For a sophisticated matching algorithm, an outlier rejection mechanism is normally necessary since in many situations there are more “outliers” (non-common scene) than “inliers” (common scene) between multi-modal images. However, only by MLPA one cannot identify those inliers and eliminate the influence of the outliers. Hence, in this work the FSPC that aims to capture the common scene information while suppressing the illumination- and sensor-dependent information is developed by using the amplitude information of local frequency.

For multi-modal images, the signals are correlated primarily in high-frequency information, while correlation between the signals tends to degrade with the reduction of high-frequency information [13]. This is because high-frequency information (e.g. edge, contour, corner, junction, etc.) normally corresponds to the physical structure that is common to images with different modalities. On the other hand, low-frequency information depends heavily on the illumination and the photometric and physical imaging properties of sensors, and these are substantially different in multi-modal images. To capture the common physical structure, the high-pass filters (e.g. Sobel, Prewitt, Laplacian, etc.) that are working in spatial domain are reasonably adopted [10, 13]. Those methods are straightforward and quite fast to compute. However, they normally depend on the intensity gradient information which highly relates with local image contrast, and therefore the non-homogeneous variation of contrast may degrade the performance of algorithm.

*x*in a signal proposed by Morrone

*et al.*in [18, 19] can be expressed as

where *E*(*x*) denotes the energy that is the magnitude of a vector sum. As we can see in the definition formula, phase congruency is the ratio of the energy *E*(*x*) to the overall length taken by the local frequency components in reaching the end point. If all the local frequency components are in phase, all the response vectors would be aligned and the value of phase congruency, PC_{1}, would be a maximum of 1. If there is no coherent of phase, the value of PC_{1} falls to a minimum of 0. Phase congruency is a quantity that is independent of the overall magnitude of the signal making it invariant to variation of image brightness and contrast.

*N*denotes the total number of filters, and ε is used for avoiding division by zero and discounting the result when both $\sum _{\theta}{\displaystyle \sum _{n}{A}_{n,\theta}\left(x\right)}$ and $\sqrt{{\displaystyle \sum _{\theta}{\displaystyle \sum _{n}{A}_{n,\theta}^{2}\left(x\right)}}}$ are very small. The value of spread function,

*s(x)*, varies between 0 and 1. If the distribution of filter responses is uniform over all spectrums,

*s(x)*, reaches its maximum value of 1. The frequency spread weighing function can be constructed by applying a hyperbolic tangent function to the filter response spread value,

*c*is the “cut-off” value, below which the value of phase congruency will be penalized, and λ is a gain factor that controls the sharpness of

*c*. Thus, the definition of FSPC can be given as

Weighting by frequency spread has benefit of reducing those ill-conditioned responses that have the low frequency spread, as well as improving the localization accuracy of features, especially the smoothed features whose responses are normally uniform [20]. In addition, the noise resistance is also improved to some extent since the responses of noise are normally skewed to the high frequency end, and therefore have the relatively narrow frequency spectrums.

## 3. Matching using local frequency representations

Having obtained the local frequency representations, we then use them to perform matching operations. As we can see from the definitions of local frequency representations, MLPA primarily represents the phase information of local frequency, whereas FSPC mainly utilizes the amplitude information, which means MLPA and FSPC can be compensated each other to some extent since information independence. Hence, by using some proper fusion scheme that makes best use of the merits of MLPA and FSPC, one can achieve better matching performance. For example, the only use of MLPA may induce errors particularly in the texture-less image regions where FSPC normally has quite small value since the lack of significant features. In addition, it may be difficult to distinguish between two search windows that have similar MLPA but different FSPC.

_{1}, FSPC

_{2}, MLPA

_{1}and MLPA

_{2}denote a pair of values of FSPC and MLPA to be compared respectively, and the definition of CAS for a single signal can be expressed as

where *d* = − |*MLPA*
_{1} − *MLPA*
_{2}|, $c=\frac{1}{2}\left(\mathit{FSP}{C}_{1}+\mathit{FSP}{C}_{2}\right)$. *d* is the similarity component that reflects how well the two signals resemble each other, and *c* is the confidence component that reflects the confidence that a match is correct.

MLPA with low FSPC is normally less reliable than those with high FSPC. Therefore, it is important to give more confidence to the higher FSPC. In fact, the confidence component is the mean value of the two FSPCs, so the confidence highly relates with the significance of signals and will be given a larger value when both signals are significant. Hence, *CAS*
_{0} is normally given a relatively large value when the two pixels are both similar and significant and a relatively small value when they are not.

*n*+1)x(2

*m*+1) centered at (x, y) and

*(u, v)*is given by

_{1}can be normalized so that its maximum value is equal to 1:

This measure returns 0 when the matching windows are identical. The denominator, *C*, is in fact related to the confidence component. For a same value of similarity *D*, the definition of CAS indicates a similarity is larger as the associated confidence components are high. It is apparent that CAS is invariant for the global linear illumination transformations: I→αI+b.

## 4. Implementation and experiments

**The values for the method parameters used in the experiments**

Number of wavelet scales | Number of wavelet orientations | Wavelength of smallest scale filter | Scaling factor between successive filters | Cut-off value “c” | Gain factor “λ” |
---|---|---|---|---|---|

4 | 9 | 3 | 2.1 | 0.55 | 10 |

**Comparisons of accuracy rates obtained from different methods**

### 4.1. Illumination invariant property

where *f* and *g* denote the raw and synthetic images respectively. From Eq. 20, we can see that the value of NCC is highly related with the degree of non-homogeneous illumination variation. If there does not exist any non-homogeneous illumination variation, NCC will be given a maximum value of 1. The image of Figure 3 shows the results of numerical evaluation for gray-level images of Figure 4, MLPAs of Figure 2, and FSPCs of Figure 5. As we can see, the NCC values of MLPA and FSPC almost keep invariant to the non-homogeneous illumination variation, although the NCC values of gray-scale images are fluctuant with the varying degree of non-homogeneous illumination variation. The homogeneous illumination variation that can be considered as a type of non-homogeneous illumination variation is not particularly validated in this work. From the visual and numerical validation, we can clearly achieve the conclusion that both MLPA and FSPC can well keep invariant to non-homogeneous illumination validation.

### 4.2. Evaluation using synthetic images

*M*and

*N*denote the height and width of image,

*v*(

*i, j*) and u(

*i, j*) denote the intensity value of a pixel without and with noise respectively. The evaluation is performed as follows: (1) select a set of templates at 10-pixel intervals within the raw image; (2) search the corresponding points for the template centers in the noisy images using different methods. The raw image used for synthetic evaluation, whose content is composed of architecture, roads, vegetation, etc., is an optical satellite image with almost ideal imaging conditions. The sizes of raw image, template and search area are 1600×1200 (pixels), 101×101 (pixels) and 201×201 (pixels) respectively, and the total matching number is 26,825. The sizes of search area and template keep same to all methods for comparison equity. If the Euclid distance between the matching result and the ground truth is less than 2 pixels, we identify the matching result as correct. The experimental images with different degrees of noise are shown in Figure 6. As we can see, the image becomes more and more blurred as SNR decreases, and when the SNR decreases to -0.47, the image content almost cannot be identified with unaided eye. The accuracy rates obtained from different methods for different SNRs are shown in Figure 7. When SNR is larger than 2, all methods are not influenced since the smoothing effect of the relatively large template. And then the accuracy rates begin decreasing with the increase of noise degree, but the accuracy rates of conventional methods decrease more quickly than our method.

### 4.3. Matching accuracy evaluation using real images

*M*(

*i, j*)→

*M*

_{max}–

*M*(

*i, j*) to transfer the optimum to the maximum value. For all correlation matrixes obtained from different methods, we use the transformation:

*M*(

*i, j*)→

*M*(

*i, j*)/

*M*’

_{max}to transfer the maximum value to 1. Obviously, we can see that the surface of our method has fewer peaks and more distinct maximum. The conventional methods give a maximum peak not very dominant unlike the surface of our method for which the maximum stands out from the rest of the surface. In addition, the maximum peak of our method is narrower, and therefore can provide better localization ability.

It should be noted that the proposed method performs better than MI. The underlying assumption of MI is that the statistical relationship between the matching images is homogeneous over the whole image domain. It is normally true when intensities mapping between matching images is global and highly correlated or when structures with different intensities in one image have similar intensities in the other image, e.g. bond and background in CT and MR. However, the statistical relationships of intensities between multi-modal image pairs are normally not global and non-homogeneous as discussed above, which are quite different from the medical images. Therefore, MI may not be sufficient for matching multi-modal images. In addition, the absence of local spatial information in MI also weakens the matching robustness to some extent.

Since symmetries are a potentially robust and stable feature of many man-made and natural scenes, which makes it suitable to represent multi-modal images, LSS designed for scoring local symmetries whose performance is almost compatible with PC works reasonably well in our experiments, although its primary goal is to extract local features from images of architectural scenes.

From the evaluation using synthetic and real images, we can achieve the conclusion: since the considerations of noise resistance, illumination adaptability and common structure extraction and weighting, the proposed method can achieve higher accuracy rate, better matching confidence than the conventional methods for the test images used.

## 5. Conclusion

To achieve robust multi-modal image match, we first present two image representations—FSPC and MLPA based on the Log-Gabor wavelet transformation, and then design the CAS that combines confidence and similarity by using the information of FSPC and MLPA to find the correspondence. The proposed method has three main merits: (1) both MLPA and FSPC keep invariant for non-homogeneous illumination (contrast, brightness) variation and contrast reversal that frequently occur between multi-modal images; (2) FSPC can effectively capture the common scene structural information while suppressing the non-common sensor-dependent properties; (3) As the confidence factor, the structural information extracted by FSPC can be allocated more weighting softly by CAS. In addition, the proposed method is threshold-free, and therefore can retain as much image detail information as possible to resist noise influence and scene distortions between images. Experiments using numerous real and synthetic images demonstrate that our method can match multi-modal images robustly. Through comparison experiments, we also demonstrate the advantage of our method over the conventional methods. In the future, we plan to introduce the geometric transformation into our matching frame, and extend our method to image alignment.

## Declarations

### Acknowledgment

This work was partly supported by Oulu University, Finland. The authors would like to thank Prof. Janne Heikkila, Dr. Jie Chen and Guoying Zhao for their contributions. The authors also want to express their gratitude to the anonymous reviewers whose thoughtful comments and suggestions improved the quality of the article.

## Authors’ Affiliations

## References

- Conte G, Doherty P: Vision-based unmanned aerial vehicle navigation using Geo-referenced information.
*EURASIP J. Adv. Sig. Process.*2009, 2009: 1-18.View ArticleGoogle Scholar - Kalal Z, Mikolajczyk K, Matas J: Tracking-learning-detection.
*IEEE Trans. Pattern Anal. Mach. Intel.*2010, 6(1):1-14.Google Scholar - Vandewalle P, Susstrunk S, Vetterli M: A frequency domain approach to registration of aliased images with application to super-resolution.
*EURASIP J. Adv. Sig. Process.*2006, 2006: 1-14.View ArticleGoogle Scholar - Brown M, Lowe D: “Unsupervised 3D object recognition and reconstruction in unordered datasets,” in proc. Int. Conf. 3-D digit.
*Imag. Model*2005, 56-63.Google Scholar - Yingzi D, Craig B, Zhi Z: “Scale invariant Gabor descriptor-based noncooperative iris recognition.
*EURASIP J. Adv. Sig. Process.*2010, 2010: 1-13.Google Scholar - Yang Y, Dong Sun P, Shuying H, Nini R: “Medical image fusion via an EffectiveWavelet-based approach.
*EURASIP J. Adv. Sig. Process.*2010, 2010: 1-13.View ArticleGoogle Scholar - Wong A, Orchard J: Robust multi-modal registration using local phase-coherence representations.
*J. Sign. Process. Syst.*2009, 54: 89-100. 10.1007/s11265-008-0202-xView ArticleGoogle Scholar - Yingdan WU, Yang MING: “A multi-sensor remote sensing image matching method based on SIFT operator and CRA similarity measure”.
*Proceedings of 2011 International Conference on Intelligence Science and Information Engineering*2011, 115-118.Google Scholar - Sasa W, Zhenbing Z, Ping Y, Zejing G: “Infrared and visible image matching algorithm based on NSCT and DAISY”.
*Proceedings of 2011 4th International Congress on Image and Signal Processing*2011, 4: 2072-2075.Google Scholar - Yong S, Jae H, Jong B: Multi-sensor image registration based on intensity and edge orientation information.
*Pattern Recognition*2008, 41: 3356-3365. 10.1016/j.patcog.2008.04.017View ArticleGoogle Scholar - Inglada J: “Similarity measures for multi-sensor remote sensing images”.
*Proceedings of Geoscience and Remote Sensing Symposium, Toulouse*2001, 5236: 182-189.Google Scholar - Kovesi P: “Image correlation from local frequency information”.
*Proceedings of the Australian Pattern Recognition Society Conference*1995, 1995: 336-341.Google Scholar - Irani PAM: “Robust multi-sensor image alignment”.
*Proceedings of the 6th International Conference on Computer Vision*1998, 959-966.Google Scholar - Josien PW, Pluim JB, Antoine M, Viergever MA: “Mutual-information-based registration of medical images: a survey.
*IEEE Trans. Med. Imag.*2003, 22(8):986-1004. 10.1109/TMI.2003.815867View ArticleGoogle Scholar - Estévez PA, Tesmer M, Perez CA, Zurada JM: Normalized mutual information feature selection.
*IEEE Trans. Neural Netw.*2009, 20(2):189-201.View ArticleGoogle Scholar - Liu J, Vemuri BC, Bova F: Efficient multi-modal image registration using local-frequency maps.
*Mach. Vis. Appl.*2002, 13: 149-163. 10.1007/s001380100072View ArticleGoogle Scholar - Morlet J, Arens G, Fourgeau E, Giard D: Wave propagation and sampling theory-part II: sampling theory and complex waves.
*Geophysics*1982, 47(2):222-236. 10.1190/1.1441329View ArticleGoogle Scholar - Morrone MC, Ross JR, Burr DC, Owens RA: Mach bands are phase dependent.
*Nature*1986, 324(6094):250-253. 10.1038/324250a0View ArticleGoogle Scholar - Morrone MC, Owens RA: Feature detection from local energy.
*Pattern. Recognit. Lett.*1987, 6: 303-313. 10.1016/0167-8655(87)90013-4View ArticleGoogle Scholar - Kovesi P: Phase congruency: a low-level image invariant.
*Psychol. Res.*2000, 64: 136-148. 10.1007/s004260000024View ArticleGoogle Scholar - Elbakary MI, Sundareshan MK: Multi-modal image registration using local frequency representation and computer-aided design (CAD) models.
*Image Vis. Comput.*2007, 25: 663-670. 10.1016/j.imavis.2006.05.009View ArticleGoogle Scholar - Zheng L, Robert L`r: “Phase congruence measurement for image similarity assessment”.
*Pattern. Recognit. Lett.*2007, 28: 166-172. 10.1016/j.patrec.2006.06.019View ArticleGoogle Scholar - Daniel Cabrini H, Noah S: “Image matching using local symmetry features”.
*Proc. CVPR*2012, 206-213.Google Scholar - Viola P, Wells WM: “Alignment by maximization of mutual information”.
*Proc. ICCV*1995, 16-23.Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.