Attentional Mechanisms for Interactive Image Exploration

A lot of work has been devoted to content-based image retrieval from large image databases. The traditional approaches are based on the analysis of the whole image content both in terms of low-level and semantic characteristics. We investigate in this paper an approach based on attentional mechanisms and active vision. We describe a visual architecture that combines bottom-up and top-down approaches for identifying regions of interest according to a given goal. We show that a coarse description of the searched target combined with a bottom-up saliency map provides an e ﬃ cient way to ﬁnd speciﬁed targets on images. The proposed system is a ﬁrst step towards the development of software agents able to search for image content in image databases.


INTRODUCTION
Image analysis is confronted with the development of large image databases and new techniques have to be designed for image and content retrieving in this context.The agent paradigm has proved its efficiency for searching in unstructured databases.An agent exhibits interaction abilities with its environment and an autonomous behavior driven by its perceptions of the environment and its expectancies.This viewpoint emphasizes the role of interaction in visual processing and is related to the active vision paradigm mainly used in robotics [1,2].We propose here to use a similar paradigm of active vision for implementing content retrieval mechanisms in fixed image or video sequences.To drive the active vision system, we need a mechanism for identifying salient regions in the visual scene.Most of the systems proposed for the computation of saliency maps are based on bottom-up approaches [3,4].We use here a bottom-up mechanism to identify a first set of salient regions and a This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.top-down mechanism for target recognition.Salient regions can be defined as high-energy contrast regions.On the other hand, regions of interest are characterized by their high relevance according to a given goal.Preattentional mechanisms are based on saliencies while attentional top-down processes are goal-directed.We thus propose an approach that combines both mechanisms in the following way.
We distinguish two nested regions in an image: the whole visual field, a low-resolution area that can be shifted by attention from position to position, and a small central foveal region that can be analyzed at full resolution.A first set of points is computed at low resolution from the whole visual field and used to give the focus to each potentially interesting region one at a time.We study here how information on the target can bias this exploratory step and improve its efficiency.We also compare different approaches for identifying or rejecting the target when it is foveated.

Definition of a saliency space
The first step in our work consisted in defining a projection space in which we can compute the saliencies present in the visual field.Although saliencies can be computed from various methods (e.g., local image contrast), we assumed here that saliencies are mainly based on preferred orientations and spatial frequencies.Consequently, we used an approach based on Gabor wavelets.The image convolution by a bidimensional Gabor wavelet can be described by the equation where I(x) is the initial image, r(x, Ω k,θ ) the filtered image, and e (−1/2)x τ Σ −1 x e −iΩk,θ x is the Gabor convolution kernel.Ω k,θ is a row vector defining the preferred orientations of the filter such that Ω k,θ = Ω k R θ where R θ is the rotation matrix defining the orientation of the filter and Ω k = (ωk 0) the central frequency of the filter.
From a statistically significant set of natural images analyzed through a Gabor wavelet bank, we extracted small image patches at random.Each patch had the same size as the foveal region.From each patch, we computed as many signature vectors v k = {r k,θ } θ∈{0,π/4,π/2,3π/4} as the number of desired frequency bands according to the following equation: where N is the number of pixels in the image patch and r * (x, Ω k,θ ) and r(x, Ω k,θ ) are complex conjugates.The multiresolution technique used to compute the v k vector is similar to the one proposed by [5].A principal component analysis (PCA) was then applied to each of these vectors for each spatial frequency channel according to z = U τ v where U is an orthogonal projection matrix such that zz τ is diagonal.
We thus obtained four projection axes in each frequency band, the components of which are linear combinations of the initial orientations.The obtained projection space is significant of the second-order statistical regularities observed in the used subset of natural images.However, experiments performed with various subsets did not show significant differences.

Preattentional and attentional controls
The saliencies of the scene at each position in the visual field can then be obtained as the projection of the v k vectors on the corresponding axis of the PCA (Figure 1).We have shown elsewhere that the salient points computed by this method differ according to the considered axis [6].Here only the first eigenvector at low resolution was used.
The obtained salient points are used to control the exploration of the scene.In the present study, two methods were used: the bottom-up control uses only information extracted from the visual scene in a Preattentional way, while the topdown control implements an attentional mechanism driven by a previously memorized information concerning the target.
We tested this architecture on a task where the system's behavior is to find targets similar to the one pointed out by the user.
In bottom-up control mode, when the user points to a region, the system finds the nearest salient point in its present visual field, focuses on it, and then computes the lowresolution bottom-up salient points in its new visual field.It then focuses on the most salient of these points and computes a recognition score of the target.Two kinds of scores have been tested: (i) one from the average of the Gabor norms, (ii) the other being simply the concatenation of the Gabor norm image vectors covering the foveal area of the system.In this study, these vectors are of dimension 12 (3 spatial frequencies, 4 orientations).
In top-down control mode, the system performs a lowresolution comparison between the salient points in its whole visual field and a low-resolution signature of the target.It thus retains only salient points superior to a given threshold.This mechanism leads to a modulation of the natural saliency of the considered point according to the low-resolution characteristics of the searched target.Two kinds of score computations were tested: (i) a comparison of the energy vectors computed from the low-resolution part of the multiresolution analysis, respectively, from the salient point x s and the target representation x t (TDE) (ii) a direct comparison of the low-frequency images of the salient region and of the target (TDV).The similarity score is thus computed using a radial basis function a = e − xs−xt 2 /2σ 2 .

Discussion of the model properties
Some points concerning this approach deserve to be discussed before the description of the obtained results.Major results have been obtained during the last decade concerning the first steps of visual processing in natural systems [7,8,9].These papers show that the first filtering steps consist in the elaboration of an optimal code based on the maximization of a statistical independence criterion.It leads to similar filters such as those obtained using independent component analysis (ICA) [10,11,12].They have been shown to be very similar to Gabor filters [13].This is why we use this approach in our model.
However, it is interesting to analyze the kind of salient features obtained from the computations described above.Experiments with several different images demonstrated that the features emphasized by such projections mainly consist in termination and curvature points.For instance, some of the features extracted from a test image according to the first PCA axis are rotation-invariant curvature points (Figure 2).
Due to their properties of end-stopping detectors, it is interesting to observe that the saliant positions computed from the image in Figure 3 can be invoked as an explanation for the Müller-Lyer illusion.

RESULTS
Although the system can be used in various object search tasks, we only present here the results obtained in a face retrieval task.
The user points a face in a scene and the task of the system is to find similar patterns across the image.On this task, we tested the three methods presented above (bottom-up, top-down energy (TDE), and top-down vector (TDV), see Figure 4).
In bottom-up mode, the system is driven by the natural saliencies computed from the scene.These saliencies are sorted according to their decreasing intensities in such a way that the system begins its exploration with the highest intensity saliency.The similarity score obtained in this case ranges from 0.1-1.0.Ten percent of the points have a similarity score in the range 0.9-1.0,while 17% are in the range 0.8-0.9.The majority of the points have a score in the range 0.6-0.9.
In the top-down mode, the system is guided through high-level information.In TDE mode, the similarity scores range from 0.3-1.0.Fourteen percent of the points lie between 0.9 and 1.0 while 22% range from 0.8-0.9.Most of the visited points have a similarity score between 0.7 and 1.0.
In TDV mode, there is a decrease in the variability of the similarity score.Sixty five percent of the points have a similarity score in the range 0.9-1.0 and 10% between 0.8 and 0.9.The most visited points lie between 0.9 and 1.0.The use of top-down information leads to a significant reduction in the number of visited points (234 for the bottom-up exploration, 107 for TDE, and 31 in TDV for the example in Figure 4).
When this experiment is repeated with various images (up to 20 images), faces always yield similarity scores greater than 0.8.We retained this value as a decision threshold separating faces and nonfaces locations.We were thus able to compute an error rate for the different experiments from a comparison between the answer of the system (a similarity score greater than 0.8 being now considered as a positive answer) and the ground truth of the target.
It results from these investigations that in the bottomup mode only 27% of the visited points are faces while this percentage increases to 36% in the TDE mode and reaches 74% in the TDV mode (Figure 5).On the other hand, in the bottom-up mode the error rate is 47%.It decreases to 26% and 30% in TDE and TDV, respectively.The TDV method gives rise to the best results.
One mandatory specification of this kind of system is its robustness according to the variations of illumination.We tested the behavior of the system in the case of the search for identical targets in a series of video images.This property is indeed especially important in the case where we want to follow the same object through a video sequence.We have used the TDV mode to search for a zone pointed out by the user in a midilluminated scene (image mean intensity 151.9 expressed in grey level) through a set of homologous images the illumination of which ranges from 69.24-185.69. Figure 6a  shows the variation of the similarity score according to the illumination for homologous points (i.e., points corresponding to the same target, in order to detect false negatives).
Figure 6b shows the same result for heterologuous points (i.e., points corresponding to different targets, in order to detect false positives).The mean score remains approximately constant in function of illumination.Its variance increases with illumination but the discrimination ability of the system (measured by the threshold between the two curves) is preserved.

DISCUSSION AND CONCLUSION
The system presented in this paper is based on two principles: (i) the selection of salient points used to guide exploratory saccades, (ii) the combination of bottom-up and top-down information to bias the saliencies in favor of the searched target.This last modulation reduces the computational load of the system.The identification of the salient points is indeed  not based on a saliency map computed on the whole scene [3,14,15] but limited to the visual field and computed at low resolution.The list of the potentially interesting coordinate points can thus be viewed as a sparse representation of the scene consisting of a system of references to the external location where the complete information lies.Such a view was first introduced by O'Regan who proposes to see the world as an external memory [16].It implements the first principles of the sensori-motor theory of perception proposed by this author [17].This mechanism is also related to the notion of de-ictic pointers proposed by Ballard [18].Note that only stable landmarks can be used for this purpose and that new questions could arise in the case of video applications.
The proposed architecture allows to perform any search and exploration task.It is indeed independent of the type and size of image and the searched target.
Our final goal is to build an exploratory vision architecture able to work in real-time.The reduction of the computational load is critical to achieve this goal.This constraint explains the limited number of preferred directions used in the computation of saliencies and the relative simplicity of the coding method.
The multiresolution technique used here, which performs the complex processing steps on previously selected regions, also provides a mechanism to overcome the realtime constraints.Though the retained information does not allow a complete reconstruction of the initial scene, it is sufficient to ensure a sufficiently fast exploration mechanism.The advantages of this approach, which distinguishes lowresolution and large-field processing from high-resolution focused computations, is twofold.It indeed reduces the need for complex computation during the exploration process and, perhaps more importantly, clearly separates the exploration and exploitation steps that constitute the behavior of the system.As suggested by psychophysics experiments [19], we make the hypothesis that the identification processes happening in peripheral and central vision are quite different.In peripheral vision, we do not need to cope with invariance, since the available representation is simplified, partial, and sparse.It is only made of a set of pointers useful for driving action.From these regions, it seems to be impossible to get a complex representation of objects [19].On the contrary, the central part of the visual field provides the information for building complex objects representations.One of the major contributions of the proposed approach is that the system does not need a complete representation of the object to select locations to focus at.The recognition process can thus take place in two steps: (i) identification of potentially interesting locations according to the searched target, (ii) recognition of the target after foveation.When the search process is biased by low-resolution information related to the target, the number of potentially interesting points dramatically decreases which improves the efficiency of the search process.We can thus parallel this mechanism with the one at work in natural vision system in which the search for a given target could be driven by a simplified description of the target, the recognition process being made easier by the fact that it operates only on focused regions.
One can argue that the proposed method is neither rotation-nor scale-invariant.However, it is inherently invariant in translation; since the targets will eventually be centered, the translational invariance problem disappears.
Another interesting fallout of considering perception as a dynamical mechanism is that the system endowed with those perceptual abilities can be viewed as a kind of autonomous agent.The interactive process in which the agent is involved can thus be improved using learning techniques

Figure 1 :
Figure 1: An overall presentation of the attentional model.A bottom-up saliency map is biased with the information on the desired target lying in long-term memory.

Figure 2 :
Figure 2: Bottom-up detection of interest points.The figure illustrates the end-stopping (termination detector) properties of the approach.

Figure 3 :
Figure 3: Bottom-up detection of interest points.(a) Detection of interest points is made on the basis of curvature and termination characteristics.(b) The energy peak from these detectors is located inside the direct arrowheads and outside the reversed arrowheads, as expected in the Müller-Lyer illusion where the direct arrowheads appear shorter than the reverse arrowheads.

Figure 4 :
Figure 4: Percent of visited points according to the similarity score.The figure shows that a large portion of visited points have a low similarity score in bottom-up exploration, while in TDE and in TDV the visited points exhibit greater similarity scores.The image shows the result obtained with the face recognition task in TDV mode.

Figure 5 :
Figure 5: (a) Evolution of the number of points explored by the system in the three investigated modes.(b) Evolution of the ratio between faces and nonfaces in the visited points (upper values) and evolution of the recognition error rate (lower values).

Figure 6 :
Figure 6: Robustness to the variations of illumination.A video sequence with a continuous variation in luminance has been used to follow the detection of homologous interest points from image to image.The figure shows the mean detection score for (a) targets and (b) nontargets superimposed with the luminance curve (expressed in grey levels).