Skip to main content

Multi-modal image matching based on local frequency information


This paper challenges the issue of matching between multi-modal images with similar physical structures but different appearances. To emphasize the common structural information while suppressing the illumination and sensor-dependent information between multi-modal images, two image representations namely Mean Local Phase Angle (MLPA) and Frequency Spread Phase Congruency (FSPC) are proposed by using local frequency information in Log-Gabor wavelet transformation space. A confidence-aided similarity (CAS) that consists of a confidence component and a similarity component is designed to establish the correspondence between multi-modal images. The two representations are both invariant to contrast reversal and non-homogeneous illumination variation, and without any derivative or thresholding operation. The CAS that integrates MLPA with FSPC tightly instead of treating them separately can more weight the common structures emphasized by FSPC, and therefore further eliminate the influence of different sensor properties. We demonstrate the accuracy and robustness of our method by comparing it with those popular methods of multi-modal image matching. Experimental results show that our method improves the traditional multi-modal image matching, and can work robustly even in quite challenging situations (e.g. SAR & optical image).

1. Introduction

Image matching that aims to find the corresponding features or image patches between two images of the same scene is often a fundamental issue in computer vision. It has been widely used in vision navigation [1], target recognition and tracking [2], super-resolution [3], 3-D reconstruction [4], pattern recognition [5], medical image processing [6], etc. In this paper, we focus on the issue of matching for multi-modal (or multi-sensor) images that differ in relation to the type of visual sensor. There are many important issues that make multi-modal image matching a very challenging problem [7]. First, multi-modal images are captured using different visual sensors (e.g. SAR, optical, infrared, etc.) at different time. Second, images with different modalities are normally mapped to different intensity values. This makes it difficult to measure similarity based on their intensity values since the same content may be represented by different intensity values. The problem is further complicated by the fact that various intrinsic and extrinsic sensing conditions may lead to image non-homogeneity. Finally, the disparity between the intensity values of multi-modal images can lead to coincidental local intensity matches between non-corresponding content, which may make the algorithm difficult to search the correct solution. Hence, the focuses of multi-modal image matching reside in illumination (contrast and brightness) invariant representations, common structure extraction from varying conditions and robust similarity measure.

The existing approaches for multi-modal image matching can be generally classified as feature-based and region-based. Feature-based matching utilizes extracted features to establish correspondence. Interest points [8, 9], edges [10], etc. are often used as the local features because of their robustness in extraction and matching. In [8], Scale Invariant Feature Transform (SIFT) and cluster reward algorithm (CRA) [11] are used to match multi-modal remote sensing images. The SIFT operator is first adopted to extract feature points and perform coarse match, and then the CRA similarity measure is used to achieve accurate correspondence. In [10], Yong et al. propose the algorithm for multi-source image matching based on information entropy which comprehensively considers of the intensity information and the edge direction information. For feature-based methods two requirements must be satisfied: (i) features are extracted robustly and (ii) feature correspondences are established reliably. Failure to meet either of them will cause this type of method to fail. In contrast to feature-based methods, region-based methods make use of the whole image content to establish correspondence. While most approaches use features for image matching, there is also a significant amount of work on region-based matching. In [12], local phase-coherence representation is constructed for multi-modal image matching. This representation has some merits that make it a promising candidate for handling situations where non-homogeneous image contrast exists: (i) it is relatively insensitive to the level of signal energy; (ii) it depends on the structures in the image and can emphasize the edges and ridges at the same time; and (iii) it has a good localization in the spatial domain. In [13], M. Irani et al. present an energy-image representation based on directional-derivative filters. A set of filters, oriented in the horizontal, vertical, and the two diagonal directions, are applied to the raw image, and then the derivative image is squared to get an “energy” image. Thus, the directional information is preserved in this energy representation. This approach, however, requires explicit directional filters and explicit filtering with Gaussian functions to create a pyramid. In addition, mutual information that has been commonly used and showed great promise in medical image processing is often adopted as the similarity measure for multi-modal image matching since it is insensitive to variation of intensities and doesn’t require knowledge of the relationship (joint intensity distribution) of the two different modalities [14, 15]. The main merit of region-based method is their ability of resistance against noise and texture distortions since abundant information can be adopted by using a relatively large template, and thus providing a high matching accuracy.

In this paper, we bring forward a local frequency information-based matching frame for multi-modal images. It takes advantage of the merits of both MLPA and FSPC by using the CAS, and can be used to match images captured by similar as well as different types of sensors at different time.

2. Image representations via local frequency information

The visual system of human can reliably recognize the same object/scene under widely varying conditions. If the illumination of a scene is changed by several orders of magnitude, our interpretation for it can keep unchanged largely. Thus, in the image matching the main form of invariance is invariance to illumination, this is particularly important for multi-modal images where non-homogeneous contrast and brightness variation frequently occur. In this work, the local frequency information is used to construct image representations namely FSPC and MLPA, which are both dimensionless and invariant to non-homogeneous illumination variation and contrast reversal, for multi-modal image matching.

2.1. Log-Gabor function

To preserve phase information, linear-phase filters that are nonorthogonal and in symmetric/anti-symmetric quadrature pairs should be used. In [16], J. Liu et al. use Gabor filters that can be tuned to any desired frequency or orientation and offer simultaneous localization of spatial and frequency information to construct local-frequency representation for multi-modal images. However, Gabor function cannot maintain a zero DC component for bandwidths over one octave. Log-Gabor filters have all the merits of Gabor filters and additionally allow constructing arbitrarily large bandwidth filters while still maintaining a zero DC component in the even-symmetric filter. Hence, in this work we prefer to use Log-Gabor filters that have a Gaussian transfer function when viewed on the logarithmic frequency scale, instead of Gabor filters, as the basis of our local frequency creation [17].

Due to the singularity of Log function at the origin, the 2D Log-Gabor filter needs to construct in the frequency domain. In polar coordinates system, the Log-Gabor function can be divided into two components: a radial component and an angular component. The radial component has a frequency response described by

G r r = exp log r / f 0 2 2 σ r 2

And the angular component has a frequency response described by

G θ θ = exp θ θ 0 2 2 σ θ 2

The two components are multiplied together to construct the overall Log-Gabor filter which has the transfer function as

G r , θ = G r r G θ θ

where (r, θ) represents the polar coordinates. As we can see from the definition formulas, the Log-Gabor filter is primarily determined by four parameters: f 0, θ 0, σ r and σ θ , where f 0 and θ 0 correspond to the center frequency and orientation angle, σ r and σ θ determine the scale and angular bandwidth respectively. The filter bank needs to make the transfer function of each filter overlap sufficiently with its neighbors so that the sum of all the transfer function forms a relatively uniform coverage of the spectrum.

2.2. Local frequency representations

In the search for invariant quantities in multi-modal images, the proposed approach is to take advantage of information from the frequency domain, rather than spatial domain. Let I denote the signal, and LG n,θ e and LG n,θ o denote the even-symmetric and odd-symmetric component of Log-Gabor function at the scale n and orientation θ. The response vector formed by the responses of each quadrature pair of filters can be expressed as

e n , θ x , o n , θ x = I x * L G n , θ e , I x * L G n , θ o

The values e n,θ (x) and o n,θ (x) can be regarded as real and imaginary parts of complex valued frequency component. The amplitude of the response vector at the scale n and orientation θ is given by

A n , θ x = e n , θ x 2 + o n , θ x 2

and the phase is given by

ϕ n , θ x = a tan 2 o n , θ x , e n , θ x

At each location x of a signal, we will have an array of these response vectors (each vector corresponds to one scale and orientation of filter). The response vectors form the basis of the proposed representations. The MLPA can be calculated as follow:

MLPA x = { a tan 2 F x , H x π × 255 , π + a tan 2 F x , H x π × 255 , if a tan 2 F x , H x 0 ; if a tan 2 F x , H x < 0.

where F(x) and H(x) can be calculated by summing the even and odd filter convolutions:

F x = θ n e n , θ x
H x = θ n o n , θ x

Contrast-reversal that may occur between the multi-modal images (e.g. Figure 1b) is eliminated by transferring the orientation of the mean local frequency vector [F(x), H(x)] that locates at the third/fourth quadrant (where αtan2(F(x),H(x))<0) to the first/second quadrant (where αtan2(F(x),H(x))≥0). Each value of MLPA, which is independent of the overall energy of the signal, is a measure of mean local phase angle. Hence, all MLPA maps have the same units, and are invariant to both scale and offset illumination changes (e.g. Figures 2 and 3). The main goal of MLPA is to eliminate the variation of intensity values between corresponding pixels of multi-modal image pair by using the phase information of local frequency. For a sophisticated matching algorithm, an outlier rejection mechanism is normally necessary since in many situations there are more “outliers” (non-common scene) than “inliers” (common scene) between multi-modal images. However, only by MLPA one cannot identify those inliers and eliminate the influence of the outliers. Hence, in this work the FSPC that aims to capture the common scene information while suppressing the illumination- and sensor-dependent information is developed by using the amplitude information of local frequency.

Figure 1
figure 1

Matching result using different multi-modal image pair. (a) Optical and SAR image; (b) infrared and optical image.

Figure 2
figure 2

MLPAs corresponding to the images of Figure 4 .

Figure 3
figure 3

The illumination-invariant property of the proposed image representations.

For multi-modal images, the signals are correlated primarily in high-frequency information, while correlation between the signals tends to degrade with the reduction of high-frequency information [13]. This is because high-frequency information (e.g. edge, contour, corner, junction, etc.) normally corresponds to the physical structure that is common to images with different modalities. On the other hand, low-frequency information depends heavily on the illumination and the photometric and physical imaging properties of sensors, and these are substantially different in multi-modal images. To capture the common physical structure, the high-pass filters (e.g. Sobel, Prewitt, Laplacian, etc.) that are working in spatial domain are reasonably adopted [10, 13]. Those methods are straightforward and quite fast to compute. However, they normally depend on the intensity gradient information which highly relates with local image contrast, and therefore the non-homogeneous variation of contrast may degrade the performance of algorithm.

Working in frequency domain, phase congruency theory postulates that the structural information can be perceived at points where the local frequency components are maximally in phase, rather than assumes it is a point of maximal intensity gradient. The measure of phase congruency at a point x in a signal proposed by Morrone et al. in [18, 19] can be expressed as

P C 1 = E x θ n A n , θ = F 2 x + H 2 x θ n A n , θ

where E(x) denotes the energy that is the magnitude of a vector sum. As we can see in the definition formula, phase congruency is the ratio of the energy E(x) to the overall length taken by the local frequency components in reaching the end point. If all the local frequency components are in phase, all the response vectors would be aligned and the value of phase congruency, PC1, would be a maximum of 1. If there is no coherent of phase, the value of PC1 falls to a minimum of 0. Phase congruency is a quantity that is independent of the overall magnitude of the signal making it invariant to variation of image brightness and contrast.

Clearly, phase congruency is only significant if it occurs over a wide range of frequencies (phase congruency over many spectrums is more significant than phase congruency over narrow spectrums). Thus, as a measure of feature significance, phase congruency should be weighted by the frequency spread. To address the problem of the conventional phase congruency [18, 19], we present a novel FSPC by using a weighing function that devalues the phase congruency at locations where the spread of filter responses is narrow. A measure of frequency spread can be defined as

s x = 1 N θ n A n , θ x θ n A n , θ 2 x + ε

where N denotes the total number of filters, and ε is used for avoiding division by zero and discounting the result when both θ n A n , θ x and θ n A n , θ 2 x are very small. The value of spread function, s(x), varies between 0 and 1. If the distribution of filter responses is uniform over all spectrums, s(x), reaches its maximum value of 1. The frequency spread weighing function can be constructed by applying a hyperbolic tangent function to the filter response spread value,

W x = 1 2 1 + tanh λ 2 s x c

where c is the “cut-off” value, below which the value of phase congruency will be penalized, and λ is a gain factor that controls the sharpness of c. Thus, the definition of FSPC can be given as

FSPC x = W x E x θ n A n , θ x + ε × 255

Weighting by frequency spread has benefit of reducing those ill-conditioned responses that have the low frequency spread, as well as improving the localization accuracy of features, especially the smoothed features whose responses are normally uniform [20]. In addition, the noise resistance is also improved to some extent since the responses of noise are normally skewed to the high frequency end, and therefore have the relatively narrow frequency spectrums.

3. Matching using local frequency representations

Having obtained the local frequency representations, we then use them to perform matching operations. As we can see from the definitions of local frequency representations, MLPA primarily represents the phase information of local frequency, whereas FSPC mainly utilizes the amplitude information, which means MLPA and FSPC can be compensated each other to some extent since information independence. Hence, by using some proper fusion scheme that makes best use of the merits of MLPA and FSPC, one can achieve better matching performance. For example, the only use of MLPA may induce errors particularly in the texture-less image regions where FSPC normally has quite small value since the lack of significant features. In addition, it may be difficult to distinguish between two search windows that have similar MLPA but different FSPC.

In this work, we propose a novel confidence-aided similarity (CAS) measure to combine the MLPA and FSPC for improving matching robustness. CAS consists of two components: a similarity component and a confidence component. Let FSPC1, FSPC2, MLPA1 and MLPA2 denote a pair of values of FSPC and MLPA to be compared respectively, and the definition of CAS for a single signal can be expressed as

CA S 0 = d + c

where d = − |MLPA 1MLPA 2|, c = 1 2 FSP C 1 + FSP C 2 . d is the similarity component that reflects how well the two signals resemble each other, and c is the confidence component that reflects the confidence that a match is correct.

MLPA with low FSPC is normally less reliable than those with high FSPC. Therefore, it is important to give more confidence to the higher FSPC. In fact, the confidence component is the mean value of the two FSPCs, so the confidence highly relates with the significance of signals and will be given a larger value when both signals are significant. Hence, CAS 0 is normally given a relatively large value when the two pixels are both similar and significant and a relatively small value when they are not.

The CAS between two windows with a size of (2n+1)x(2m+1) centered at (x, y) and (u, v) is given by

CA S 1 x , y ; u , v = C / 2 D


C = i = n n j = m m FSP C 1 x + i , y + j + FSP C 2 ( u + i , v + j
D = i = n n j = m m MLP A 1 x + i , y + j MLP A 2 ( u + i , v + j

CAS1 can be normalized so that its maximum value is equal to 1:

CA S 2 x , y ; u , v = C / 2 D / C / 2 = 1 2 D / C

The equality can be further simplified as

CAS x , y ; u , v = D / C

This measure returns 0 when the matching windows are identical. The denominator, C, is in fact related to the confidence component. For a same value of similarity D, the definition of CAS indicates a similarity is larger as the associated confidence components are high. It is apparent that CAS is invariant for the global linear illumination transformations: I→αI+b.

4. Implementation and experiments

The primary procedures for the proposed approach can be stated as follows. (1) Calculate the local frequency information by applying Log-Gabor wavelet transformation to raw multi-modal images; (2) Construct the local frequency representations—MLPA and FSPC based on the local frequency information; (3) Search the correspondence by minimizing the CAS between the template and the searching window. The values for the primary parameters used in the experiments are given in Table 1. These values are evaluated in a heuristic manner, and there is no need to change them for adapting different multi-modal scenes during the image matching. In the experiments, we notice that, for all parameters, a good value can be chosen across a relatively wide range of values. Actually, for the given wavelength of smallest scale filter, scaling factor between successive filters, cut-off value and gain factor, more wavelet scales and orientations can bring better experimental performance, but increase the computational time inevitably. Hence, we choose 4 and 9 as the number of wavelet scales and orientations respectively to compromise between performance and efficiency. To evaluate the performance of our method, we conduct numerous experiments using both synthetic and real images, and compare the experimental results with the state of the art methods, including four-directional derivative-energy image (FDDEI) [13], local frequency representation (LFR) [21], phase congruence (PC) [22], local symmetry score (LSS) [23], and mutual information (MI) [24]. In the experiments, joint histograms for calculating the MI are generated with 32×32 bins as suggested in [11]. Local symmetry score is included due to its illumination invariant property and robustness to a range of dramatic variations. We adopt the product of the horizontal and vertical symmetry scores, which are based on a histogram of local gradient orientations and more stable to photometric changes, as the image representation, and the zero mean normalized cross correlation (ZNCC) as the similarity measure.

Table 1 The values for the method parameters used in the experiments
Table 2 Comparisons of accuracy rates obtained from different methods

4.1. Illumination invariant property

Many visual and numerical experiments are first conducted to evaluate the illumination invariant property of the proposed MLPA and FSPC. The non-homogeneous illumination variation is synthesized by dividing an image into four equal parts and multiplying each part by a random scale factor to simulate the contrast variation and then adding a random constant factor to simulate the brightness variation. In Figure 4, we show a set of synthetic images with non-homogeneous illumination variation. We can observe the obvious contrast and brightness variation between different parts and different images. In Figures 2 and 5, we show the images of MLPA and FSPC corresponding to the images of Figure 4. As we can see, the illumination variation almost cannot be observed with unaided eye. At the boundary of each non-homogeneous illumination region, we can observe some straight line edges since the distribution of intensity values in the neighborhood of boundary is similar to that in the neighborhood of step edge.

Figure 4
figure 4

Gray-level images with non-homogeneous illumination variation. The 1st image is the raw infrared image, and the rest are the synthetic images.

Figure 5
figure 5

FSPCs corresponding to the images of Figure 4 .

To perform the numerical evaluation, we employ the normalized cross-correlation (NCC) to measure the similarity between the raw image and the synthetic image with non-homogeneous illumination variation. The definition of NCC can be expressed as

NCC= i j f i , j g i , j i j f 2 i , j i j g 2 i , j

where f and g denote the raw and synthetic images respectively. From Eq. 20, we can see that the value of NCC is highly related with the degree of non-homogeneous illumination variation. If there does not exist any non-homogeneous illumination variation, NCC will be given a maximum value of 1. The image of Figure 3 shows the results of numerical evaluation for gray-level images of Figure 4, MLPAs of Figure 2, and FSPCs of Figure 5. As we can see, the NCC values of MLPA and FSPC almost keep invariant to the non-homogeneous illumination variation, although the NCC values of gray-scale images are fluctuant with the varying degree of non-homogeneous illumination variation. The homogeneous illumination variation that can be considered as a type of non-homogeneous illumination variation is not particularly validated in this work. From the visual and numerical validation, we can clearly achieve the conclusion that both MLPA and FSPC can well keep invariant to non-homogeneous illumination validation.

4.2. Evaluation using synthetic images

We evaluate the matching accuracy and noise resistance using the synthetic images generated by adding the gaussian white noise generated using the imnoise function of Matlab 2010b to the raw images. The mean of noise is given a same value of 0, and the variance is ascending from 0.1 to 3.5 gradually. Without loss of generality, we employ Signal to Noise Ratio (SNR) to describe the degree of noise. The definition of SNR is given as

SNR=10× log 10 i = 1 M j = 1 N v i , j 2 i = 1 M j = 1 N u i , j v i , j 2

where M and N denote the height and width of image, v(i, j) and u(i, j) denote the intensity value of a pixel without and with noise respectively. The evaluation is performed as follows: (1) select a set of templates at 10-pixel intervals within the raw image; (2) search the corresponding points for the template centers in the noisy images using different methods. The raw image used for synthetic evaluation, whose content is composed of architecture, roads, vegetation, etc., is an optical satellite image with almost ideal imaging conditions. The sizes of raw image, template and search area are 1600×1200 (pixels), 101×101 (pixels) and 201×201 (pixels) respectively, and the total matching number is 26,825. The sizes of search area and template keep same to all methods for comparison equity. If the Euclid distance between the matching result and the ground truth is less than 2 pixels, we identify the matching result as correct. The experimental images with different degrees of noise are shown in Figure 6. As we can see, the image becomes more and more blurred as SNR decreases, and when the SNR decreases to -0.47, the image content almost cannot be identified with unaided eye. The accuracy rates obtained from different methods for different SNRs are shown in Figure 7. When SNR is larger than 2, all methods are not influenced since the smoothing effect of the relatively large template. And then the accuracy rates begin decreasing with the increase of noise degree, but the accuracy rates of conventional methods decrease more quickly than our method.

Figure 6
figure 6

Noisy images with the SNR of 5.1728, 2.0026, 1.0864 and -0.47 respectively.

Figure 7
figure 7

Accuracy Rates for different SNRs.

4.3. Matching accuracy evaluation using real images

To evaluate matching accuracy comprehensively and objectively, we perform numerous experiments using many real multi-modal image pairs. The database used for image matching includes 254 pairs of infrared and optical images and 52 pairs of SAR and optical images with a wide range of illumination and significant appearance changes caused by photometric and physical imaging properties of different sensors. The images of Figures 8 and 9 show two sets of matching results obtained from different methods, and in Table 2 we give the accuracy rates corresponding to the image pairs of Figures 8 and 9. As we can see in Figures 8, 9 and 10, the non-homogeneous contrast and brightness variation occurs frequently between the multi-modal image pairs, but the structural information still keeps common and reliable, for example, the edges of airport runway in Figure 8, the contours of cars, architecture, lampposts, persons and etc. in Figures 1 and 9. For the optical and infrared image pair of Figure 9, the conventional methods work well since the relatively significant feature and contrast variation, but the performance of conventional methods is degraded dramatically while handling the optical and SAR image pair of Figure 8 that has harsh speckle noise and contrast variation. The proposed method works reasonably well for those two situations.

Figure 8
figure 8

Matching results using the optical image and SAR. From left: Optical image (template center is labeled with the cross.); Our method; PC; LFR; FDDE; LSS; MI.

Figure 9
figure 9

Matching results using the optical image and infrared image. From left: Optical image; Our method; PC; LFR; FDDE; LSS; MI.

Figure 10
figure 10

Matching results of the proposed method. (a) Aerial images of the intensifier charge coupled device (ICCD); (b) Reference image.

We have applied the proposed method to scene matching used for Unmanned Aerial Vehicle (UAV) positioning, and conducted a total of 10,080 scene matching experiments using aerial images obtained by electro-optic pods, including 6,180 infrared images and 3,900 ICCD images. The ground scene consists of architecture, river, vegetation, farmland, highland, etc. The imaging time-range is day-and-night, and the imaging altitude ranges from 150 meters to 2000 meters. Since the reference image, obtained from space-borne optical sensor with a spatial resolution of 4 meters, normally has a relatively slow update rate, the aerial images and reference images are normally several years apart, and with dramatic differences in ground scenes (e.g. appearance/disappearance of architectures, growth/witherer of vegetation, drought/waterlogging of rivers, etc.), as well as changes caused by different sensors. In Figure 10, we show a set of results of scene matching. The geometric distortion caused by imaging attitude and altitude is eliminated by using the information of INS and altimeter. The truth values of scene matching are provided by GPS, which generally has a positioning accuracy better than 1 meter. If the difference between the scene matching result and GPS is less than 8 meters (2 pixels), we identify the result as correct; otherwise it fails. According to the criterion, the accuracy rate of our scene matching is 96.63%, which is well within the requirements for engineering, whereas the accuracy rates of PC, LFR, FDDE, LSS and MI are 85.78%, 82.69%, 76.73%, 83.24% and 74.89% respectively. In Figure 11, we show a set of flight trajectories measured by different methods. As we can see, the results of our method are more coincident with GPS. Very few false matches existing in our results can be effectively eliminated by the filtering operation (e.g. Kalman filter).

Figure 11
figure 11

Comparisons of flight trajectories obtained from different methods. (a) GPS; (b) Our method; (c) PC; (d) LFR; (e) FDDE; (f) LSS; (g) MI.

The shape of correlation surface is related to the confidence of matching result. We examine numerous correlation surfaces computed from multi-modal image pairs randomly chosen from the database of image matching and scene matching. In Figure 1a,b, we show two matching results for the optical & SAR images and the infrared & optical images, respectively, and in Figures 12 and 13, we show the correlation surfaces obtained from different methods for the matching results of Figure 1a,b, respectively. For the correlation matrix whose optimum corresponds to the minimum value, we reverse the correlation surface using the transformation: M(i, j)→M maxM(i, j) to transfer the optimum to the maximum value. For all correlation matrixes obtained from different methods, we use the transformation: M(i, j)→M(i, j)/Mmax to transfer the maximum value to 1. Obviously, we can see that the surface of our method has fewer peaks and more distinct maximum. The conventional methods give a maximum peak not very dominant unlike the surface of our method for which the maximum stands out from the rest of the surface. In addition, the maximum peak of our method is narrower, and therefore can provide better localization ability.

Figure 12
figure 12

Correlation surfaces for the images of Figure 1 a. From left: our method; PC; LFR; FDDE; LSS; MI.

Figure 13
figure 13

Correlation surfaces for the images of Figure 1 b. From left: our method; PC; LFR; FDDE; LSS; MI.

It should be noted that the proposed method performs better than MI. The underlying assumption of MI is that the statistical relationship between the matching images is homogeneous over the whole image domain. It is normally true when intensities mapping between matching images is global and highly correlated or when structures with different intensities in one image have similar intensities in the other image, e.g. bond and background in CT and MR. However, the statistical relationships of intensities between multi-modal image pairs are normally not global and non-homogeneous as discussed above, which are quite different from the medical images. Therefore, MI may not be sufficient for matching multi-modal images. In addition, the absence of local spatial information in MI also weakens the matching robustness to some extent.

Since symmetries are a potentially robust and stable feature of many man-made and natural scenes, which makes it suitable to represent multi-modal images, LSS designed for scoring local symmetries whose performance is almost compatible with PC works reasonably well in our experiments, although its primary goal is to extract local features from images of architectural scenes.

From the evaluation using synthetic and real images, we can achieve the conclusion: since the considerations of noise resistance, illumination adaptability and common structure extraction and weighting, the proposed method can achieve higher accuracy rate, better matching confidence than the conventional methods for the test images used.

5. Conclusion

To achieve robust multi-modal image match, we first present two image representations—FSPC and MLPA based on the Log-Gabor wavelet transformation, and then design the CAS that combines confidence and similarity by using the information of FSPC and MLPA to find the correspondence. The proposed method has three main merits: (1) both MLPA and FSPC keep invariant for non-homogeneous illumination (contrast, brightness) variation and contrast reversal that frequently occur between multi-modal images; (2) FSPC can effectively capture the common scene structural information while suppressing the non-common sensor-dependent properties; (3) As the confidence factor, the structural information extracted by FSPC can be allocated more weighting softly by CAS. In addition, the proposed method is threshold-free, and therefore can retain as much image detail information as possible to resist noise influence and scene distortions between images. Experiments using numerous real and synthetic images demonstrate that our method can match multi-modal images robustly. Through comparison experiments, we also demonstrate the advantage of our method over the conventional methods. In the future, we plan to introduce the geometric transformation into our matching frame, and extend our method to image alignment.


  1. Conte G, Doherty P: Vision-based unmanned aerial vehicle navigation using Geo-referenced information. EURASIP J. Adv. Sig. Process. 2009, 2009: 1-18.

    Article  Google Scholar 

  2. Kalal Z, Mikolajczyk K, Matas J: Tracking-learning-detection. IEEE Trans. Pattern Anal. Mach. Intel. 2010, 6(1):1-14.

    Google Scholar 

  3. Vandewalle P, Susstrunk S, Vetterli M: A frequency domain approach to registration of aliased images with application to super-resolution. EURASIP J. Adv. Sig. Process. 2006, 2006: 1-14.

    Article  Google Scholar 

  4. Brown M, Lowe D: “Unsupervised 3D object recognition and reconstruction in unordered datasets,” in proc. Int. Conf. 3-D digit. Imag. Model 2005, 56-63.

    Google Scholar 

  5. Yingzi D, Craig B, Zhi Z: “Scale invariant Gabor descriptor-based noncooperative iris recognition. EURASIP J. Adv. Sig. Process. 2010, 2010: 1-13.

    Google Scholar 

  6. Yang Y, Dong Sun P, Shuying H, Nini R: “Medical image fusion via an EffectiveWavelet-based approach. EURASIP J. Adv. Sig. Process. 2010, 2010: 1-13.

    Article  Google Scholar 

  7. Wong A, Orchard J: Robust multi-modal registration using local phase-coherence representations. J. Sign. Process. Syst. 2009, 54: 89-100. 10.1007/s11265-008-0202-x

    Article  Google Scholar 

  8. Yingdan WU, Yang MING: “A multi-sensor remote sensing image matching method based on SIFT operator and CRA similarity measure”. Proceedings of 2011 International Conference on Intelligence Science and Information Engineering 2011, 115-118.

    Google Scholar 

  9. Sasa W, Zhenbing Z, Ping Y, Zejing G: “Infrared and visible image matching algorithm based on NSCT and DAISY”. Proceedings of 2011 4th International Congress on Image and Signal Processing 2011, 4: 2072-2075.

    Google Scholar 

  10. Yong S, Jae H, Jong B: Multi-sensor image registration based on intensity and edge orientation information. Pattern Recognition 2008, 41: 3356-3365. 10.1016/j.patcog.2008.04.017

    Article  Google Scholar 

  11. Inglada J: “Similarity measures for multi-sensor remote sensing images”. Proceedings of Geoscience and Remote Sensing Symposium, Toulouse 2001, 5236: 182-189.

    Google Scholar 

  12. Kovesi P: “Image correlation from local frequency information”. Proceedings of the Australian Pattern Recognition Society Conference 1995, 1995: 336-341.

    Google Scholar 

  13. Irani PAM: “Robust multi-sensor image alignment”. Proceedings of the 6th International Conference on Computer Vision 1998, 959-966.

    Google Scholar 

  14. Josien PW, Pluim JB, Antoine M, Viergever MA: “Mutual-information-based registration of medical images: a survey. IEEE Trans. Med. Imag. 2003, 22(8):986-1004. 10.1109/TMI.2003.815867

    Article  Google Scholar 

  15. Estévez PA, Tesmer M, Perez CA, Zurada JM: Normalized mutual information feature selection. IEEE Trans. Neural Netw. 2009, 20(2):189-201.

    Article  Google Scholar 

  16. Liu J, Vemuri BC, Bova F: Efficient multi-modal image registration using local-frequency maps. Mach. Vis. Appl. 2002, 13: 149-163. 10.1007/s001380100072

    Article  Google Scholar 

  17. Morlet J, Arens G, Fourgeau E, Giard D: Wave propagation and sampling theory-part II: sampling theory and complex waves. Geophysics 1982, 47(2):222-236. 10.1190/1.1441329

    Article  Google Scholar 

  18. Morrone MC, Ross JR, Burr DC, Owens RA: Mach bands are phase dependent. Nature 1986, 324(6094):250-253. 10.1038/324250a0

    Article  Google Scholar 

  19. Morrone MC, Owens RA: Feature detection from local energy. Pattern. Recognit. Lett. 1987, 6: 303-313. 10.1016/0167-8655(87)90013-4

    Article  Google Scholar 

  20. Kovesi P: Phase congruency: a low-level image invariant. Psychol. Res. 2000, 64: 136-148. 10.1007/s004260000024

    Article  Google Scholar 

  21. Elbakary MI, Sundareshan MK: Multi-modal image registration using local frequency representation and computer-aided design (CAD) models. Image Vis. Comput. 2007, 25: 663-670. 10.1016/j.imavis.2006.05.009

    Article  Google Scholar 

  22. Zheng L, Robert L`r: “Phase congruence measurement for image similarity assessment”. Pattern. Recognit. Lett. 2007, 28: 166-172. 10.1016/j.patrec.2006.06.019

    Article  Google Scholar 

  23. Daniel Cabrini H, Noah S: “Image matching using local symmetry features”. Proc. CVPR 2012, 206-213.

    Google Scholar 

  24. Viola P, Wells WM: “Alignment by maximization of mutual information”. Proc. ICCV 1995, 16-23.

    Google Scholar 

Download references


This work was partly supported by Oulu University, Finland. The authors would like to thank Prof. Janne Heikkila, Dr. Jie Chen and Guoying Zhao for their contributions. The authors also want to express their gratitude to the anonymous reviewers whose thoughtful comments and suggestions improved the quality of the article.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Xiaochun Liu.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Liu, X., Lei, Z., Yu, Q. et al. Multi-modal image matching based on local frequency information. EURASIP J. Adv. Signal Process. 2013, 3 (2013).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: