EURASIP Journal on Applied Signal Processing 2003:8, 841–859 c ○ 2003 Hindawi Publishing Corporation A Domain-Independent Window Approach to Multiclass Object Detection Using Genetic Programming

This paper describes a domain-independent approach to the use of genetic programming for object detection problems in which the locations of small objects of multiple classes in large images must be found. The evolved program is scanned over the large images to locate the objects of interest. The paper develops three terminal sets based on domain-independent pixel statistics and considers two different function sets. The fitness function is based on the detection rate and the false alarm rate. We have tested the method on three object detection problems of increasing difficulty. This work not only extends genetic programming to multiclass-object detection problems, but also shows how to use a single evolved genetic program for both object classification and localisation. The object classification map developed in this approach can be used as a general classification strategy in genetic programming for multiple-class classification problems.


INTRODUCTION
As more and more images are captured in electronic form, the need for programs which can find objects of interest in a database of images is increasing. For example, it may be necessary to find all tumors in a database of x-ray images, all cyclones in a database of satellite images, or a particular face in a database of photographs. The common characteristic of such problems can be phrased as "given subimage 1 , subimage 2 , . . . , subimage n which are examples of the objects of interest, find all images which contain this object and its location(s)." Figure 10 shows examples of problems of this kind. In the problem illustrated by Figure 10b, we want to find centers of all of the Australian 5-cent and 20-cent coins and determine whether the head or the tail side is up. Examples of other problems of this kind include target detection problems [1,2,3], where the task is to find, say, all tanks, trucks, or helicopters in an image. Unlike most of the cur-rent work in the object recognition area, where the task is to detect only objects of one class [1,4,5], our objective is to detect objects from a number of classes.
Domain independence means that the same method will work unchanged on any problem, or at least on some range of problems. This is very difficult to achieve at the current state of the art in computer vision because most systems require careful analysis of the objects of interest and a determination of which features are likely to be useful for the detection task. Programs for extracting these features must then be coded or found in some feature library. Each new vision system must be handcrafted in this way. Our approach is to work from the raw pixels directly or to use easily computed pixel statistics such as the mean and variance of the pixels in a subimage and to evolve the programs needed for object detection.
Several approaches have been applied to automatic object detection and recognition problems. Typically, they use multiple independent stages, such as preprocessing, edge detection, segmentation, feature extraction, and object classification [6,7], which often results in some efficiency and effectiveness problems. The final results rely too much upon the results of earlier stages. If some objects are lost in one of the early stages, it is very difficult or impossible to recover them in the later stage. To avoid these disadvantages, this paper introduces a single-stage approach.
There have been a number of reports on the use of genetic programming (GP) in object detection and classification [8,9]. Winkeler and Manjunath [10] describe a GP system for object detection in which the evolved functions operate directly on the pixel values. Teller and Veloso [11] describe a GP system and a face recognition application in which the evolved programs have a local indexed memory. All of these approaches are based on detecting one class of objects or two-class classification problems, that is, objects versus everything else. GP naturally lends itself to binary problems as a program output of less than 0 can be interpreted as one class and greater than or equal to 0 as the other class. It is not obvious how to use GP for more than two classes. The approach in this paper will focus on object detection problems in which a number of objects in more than two classes of interest need to be localised and classified.

Outline of the approach to object detection
A brief outline of the method is as follows.
(1) Assemble a database of images in which the locations and classes of all of the objects of interest are manually determined. Split these images into a training set and a test set. (2) Determine an appropriate size (n × n) of a square which will cover all single objects of interest to form the input field. (3) Invoke an evolutionary process with images in the training set to generate a program which can determine the class of an object in its input field. (4) Apply the generated program as a moving window template to the images in the test set and obtain the locations of all the objects of interest in each class. Calculate the detection rate (DR) and the false alarm rate (FAR) on the test set as the measure of performance.

Goals
The overall goal of this paper is to investigate a learning/adaptive, single-stage, and domain-independent approach to multiple-class object detection problems without any preprocessing, segmentation, and specific feature extraction. This approach is based on a GP technique. Rather than using specific image features, pixel statistics are used as inputs to the evolved programs. Specifically, the following questions will be explored on a sequence of detection problems of increasing difficulty to determine the strengths and limitations of the method.
(i) What image features involving pixels and pixel statistics would make useful terminals?
(ii) Will the 4 standard arithmetic operators be sufficient for the function set? (iii) How can the fitness function be constructed, given that there are multiple classes of interest? (iv) How will performance vary with increasing difficulty of image detection problems? (v) Will the performance be better than a neural network (NN) approach [12] on the same problems?

Structure
The remainder of this paper gives a brief literature survey, then describes the main components of this approach including the terminal set, the function set, and the fitness function. After describing the three image databases used here, we present the experimental results and compare them with an NN method. Finally, we analyse the results and the evolved programs and present our conclusions.

Object detection
The term object detection here refers to the detection of small objects in large images. This includes both object classification and object localisation. Object classification refers to the task of discriminating between images of different kinds of objects, where each image contains only one of the objects of interest. Object localisation refers to the task of identifying the positions of all objects of interest in a large image. The object detection problem is similar to the commonly used terms automatic target recognition and automatic object recognition. We classify the existing object detection systems into three dimensions based on whether the approach is segmentation free or not, domain independent or specific, and on the number of object classes of interest in an image.

Segmentation-based versus single stage
According to the number of independent stages used in the detection procedure, we divide the detection methods into two categories.
(i) Segmentation-based approach, which uses multiple independent stages for object detection. Most research on object detection involves 4 stages: preprocessing, segmentation, feature extraction, and classification [13,14,15], as shown in Figure 1. The preprocessing stage aims to remove noise or enhance edges. In the segmentation stage, a number of coherent regions and "suspicious" regions which might contain objects are usually located and separated from the entire images. The feature extraction stage extracts domain-specific features from the segmented regions. Finally, the classification stage uses these features to distinguish the classes of the objects of interest. The algorithms or methods for these stages are generally domain specific. Learning paradigms, such as NNs and genetic algorithms/programming, have usually been applied to the classification stage. In general, each independent stage needs a program to fulfill that specific task and, accordingly, multiple programs are needed for object detection problems. Success at each stage is critical  Figure 1: A typical procedure for object detection.
to achieving good final detection performance. Detection of trucks and tanks in visible, multispectral infrared, and synthetic aperture radar images [2], and recognition of tanks in cluttered images [6] are two examples.
(ii) Single-stage approach, which uses only a single stage to detect the objects of interest in large images. There is only a single program produced for the whole object detection procedure. The major property of this approach is that it is segmentation free. Detecting tanks in infrared images [3] and detecting small targets in cluttered images [16] based on a single NN are examples of this approach.
While most recent work on object detection problems concentrates on the segmentation-based approach, this paper focuses on the single-stage approach.

Domain-specific approach versus domain-independent approach
In terms of the generalisation of the detection systems, there are two major approaches.
(i) Domain-specific object detection, which uses specific image features as inputs to the detector or classifier. These features, which are usually highly domain dependent, are extracted from entire images or segmented images. In a lentil grading and quality assessment system [17], for example, features such as brightness, colour, size, and perimeter are extracted and used as inputs to an NN classifier. This approach generally involves a time-consuming investigation of good features for a specific problem and a handcrafting of the corresponding feature extraction programs.
(ii) Domain-independent object detection, which usually uses the raw pixels directly (no features) as inputs to the detector or classifier. In this case, feature selection, extraction, and the handcrafting of corresponding programs can be completely removed. This approach usually needs learning and adaptive techniques to learn features for the detection task. Directly using raw image pixel data as input to NNs for detecting vehicles (tanks, trucks, cars, etc.) in infrared images [1] is such an example. However, long learning/evolution times are usually required due to the large number of pixels. Furthermore, the approach generally requires a large number of training examples [18]. A special case is to use a small number of domain-independent, pixel level features (referred to as pixel statistics) such as the mean and variance of some portions of an image [19].

Multiple class versus single class
Regarding the number of object classes of interest in an image, there are two main types of detection problems.
(i) One-class object detection problem, where there are multiple objects in each image, however they belong to a sin-gle class. One special case in this category is that there is only one object of interest in each source image. In nature, these problems contain a binary classification problem: object versus nonobject, also called object versus background. Examples are detecting small targets in thermal infrared images [16] and detecting a particular face in photograph images [20].
(ii) Multiple-class object detection problem, where there are multiple object classes of interest, each of which has multiple objects in each image. Detection of handwritten digits in zip code images [21] is an example of this kind.
It is possible to view a multiclass problem as series of binary problems. A problem with objects 3 classes of interest can be implemented as class1 against everything else, class2 against everything else, and class 3 against everything else. However, these are not independent detectors as some methods of dealing with situations when two detectors report an object at the same location must be provided.
In general, multiple-class object detection problems are more difficult than one-class detection problems. This paper is focused on detecting multiple objects from a number of classes in a set of images, which is particularly difficult. Most research in object detection which has been done so far belongs to the one-class object detection problem.

Performance evaluation
In this paper, we use the DR and FAR to measure the performance of multiclass object detection problems. The DR refers to the number of small objects correctly reported by a detection system as a percentage of the total number of actual objects in the image(s). The FAR, also called false alarms per object or false alarms/object [16], refers to the number of nonobjects incorrectly reported as objects by a detection system as a percentage of the total number of actual objects in the image(s). Note that the DR is between 0 and 100%, while the FAR may be greater than 100% for difficult object detection problems.
The main goal of object detection is to obtain a high DR and a low FAR. There is, however, a trade-off between them for a detection system. Trying to improve the DR often results in an increase in the FAR, and vice versa. Detecting objects in images with very cluttered backgrounds is an extremely difficult problem where FARs of 200-2000% (i.e., the detection system suggests that there are 20 times as many objects as there really are) are common [5,16].
Most research which has been done in this area so far only presents the results of the classification stage (only the final stage in Figure 1) and assumes that all other stages have been properly done. However, the results presented in this paper are the performance for the whole detection problem (both the localisation and the classification).

Related work-GP for object detection
Since the early 1990s, there has been only a small amount of work on applying GP techniques to object classification, object detection, and other vision problems. This, in part, reflects the fact that GP is a relatively young discipline compared with, say, NNs.

Object classification
Tackett [9,22] uses GP to assign detected image features to a target or nontarget category. Seven primitive image features and twenty statistical features are extracted and used as the terminal set. The 4 standard arithmetic operators and a logic function are used as the function set. The fitness function is based on the classification result. The approach was tested on US Army NVEOD Terrain Board imagery, where vehicles, such as tanks, need to be classified. The GP method outperformed both an NN classifier and a binary tree classifier on the same data, producing lower rates of false positives for the same DRs.
Andre [23] uses GP to evolve functions that traverse an image, calling upon coevolved detectors in the form of hitmiss matrices to guide the search. These hit-miss matrices are evolved with a two-dimensional genetic algorithm. These evolved functions are used to discriminate between two letters or to recognise single digits.
Koza in [24, Chapter 15] uses a "turtle" to walk over a bitmap landscape. This bitmap is to be classified either as a letter "L," a letter "I," or neither of them. The turtle has access to the values of the pixels in the bitmap by moving over them and calling a detector primitive. The turtle uses a decision tree process, in conjunction with negative primitives, to walk over the bitmap and decide which category a particular landscape falls into. Using automatically defined functions as local detectors and a constrained syntactic structure, some perfect scoring classification programs were found. Further experiments showed that detectors can be made for different sizes and positions of letters, although each detector has to be specialised to a given combination of these factors.
Teller and Veloso [11] use a GP method based on the PADO language to perform face recognition tasks on a database of face images in which the evolved programs have a local indexed memory. The approach was tested on a discrimination task between 5 classes of images [25] and achieved up to 60% correct classification for images without noise.
Robinson and McIlroy [26] apply GP techniques to the problem of eye location in grey-level face images. The input data from the images is restricted to a 3000-pixel block around the location of the eyes in the face image. This approach produced promising results over a very small training set, up to 100% true positive detection with no false positives, on a three-image training set. Over larger sets, the GP approach performed less well however, and could not match the performance of NN techniques.
Winkeler and Manjunath [10] produce genetic programs to locate faces in images. Face samples are cut out and scaled, then preprocessed for feature extraction. The statis-tics gleaned from these segments are used as terminals in GP which evolves an expression returning how likely a pixel is to be part of a face image. Separate experiments process the grey-scale image directly, using low-level image processing primitives and scale-space filters.

Object detection
All of the reported GP-based object detection approaches belong to the one-class object detection category. In these detection problems, there is only one object class of interest in the large images.
Howard et al. [19] present a GP approach to automatic detection of ships in low-resolution synthetic aperture radar imagery. A number of random integer/real constants and pixel statistics are used as terminals. The 4 arithmetic operators and min and max operators constitute the function set. The fitness is based on the number of the true positive and false positive objects detected by the evolved program. A two-stage evolution strategy was used in this approach. In the first stage, GP evolved a detector that could correctly distinguish the target (ship) pixels from the nontarget (ocean) pixels. The best detector was then applied to the entire image and produced a number of false alarms. In the second stage, a brand new run of GP was tasked to discriminate between the clear targets and the false alarms as identified in the first stage and another detector was generated. This two-stage process resulted in two detectors that were then fused using the min function. These two detectors return a real number, which if greater than zero denotes a ship pixel, and if zero or less denotes an ocean pixel. The approach was tested on images chosen from commercial SAR imagery, a set of 50 m and 100 m resolution images of the English Channel taken by the European Remote Sensing satellite. One of the 100 m resolution images was used for training, two for validation, and two for testing. The training was quite successful with perfect DR and no false alarms, while there was only one false positive in each of the two test images and the two validation images which contained 22, 22, 48, and 41 true objects.
Isaka [27] uses GP to locate mouth corners in small (50 × 40) images taken from images of faces. Processing each pixel independently using an approach based on relative intensities of surrounding pixels, the GP approach was shown to perform comparably to a template matching approach on the same data.
A list of object detection related work based on GP is shown in Table 1.

The GP system
In this section, we describe our approach to a GP system for multiple-class object detection problems. Figure 2 shows an overview of this approach, which has a learning process and a testing procedure. In the learning/evolutionary process, the evolved genetic programs use a square input field which is large enough to contain each of the objects of interest. The programs are applied in a moving window fashion to the Image compression Nordin and Banzhaf 1996 [38] entire images in the training set to detect the objects of interest. In the test procedure, the best evolved genetic program obtained in the learning process is then applied to the entire images in the test set to measure object detection performance.
The learning/evolutionary process in our GP approach is summarised as follows.
(2) Repeat until a termination criterion is satisfied.
(2.1) Evaluate the individual programs in the current population. Assign a fitness to each program. (2.2) Until the new population is fully created, repeat the following: (i) select programs in the current generation; (ii) perform genetic operators on the selected programs; (iii) insert the result of the genetic operations into the new generation. (3) Present the best individual in the population as the output-the learned/evolved genetic program.
In this system, we used a tree-like program structure to represent genetic programs. The ramped half-and-half method was used for generating the programs in the initial population and for the mutation operator. The proportional selection mechanism and the reproduction, crossover, and mutation operators were used in the learning process.
In the remainder of this section, we address the other aspects of the learning/evolutionary system: (1) determination of the terminal set, (2) determination of the function set, (3) development of a classification strategy, (4) construction of the fitness measure, and (5) selection of the input parameters and determination of the termination strategy.

The terminal sets
For object detection problems, terminals generally correspond to image features. In our approach, we designed three different terminal sets: local rectilinear features, circular features, and "pixel features." In all these cases, the features are statistical properties of regions of the image, and we refer to them as pixel statistics.

Terminal set I-rectilinear features
In the first terminal set, twenty pixel statistics, F 1 to F 20 in Table 2, are extracted from the input field as shown in Figure 3. The input field must be sufficiently large to contain the biggest object and some background, yet small enough to include only a single object. In this way, the evolved program, as a detector, could automate the "human eye system" of identifying pixels/object centres which stand out from their local surroundings.
In Figure 3, the grey-filled circle denotes an object of interest and the square A 1 B 1 C 1 D 1 represents the input field.

Detection results
Object detection (GP testing)

General programs
Entire images (detection test set) GP learning/evolutionary process Entire images (detection training set) Figure 2: An overview of the GP approach for multiple-class object detection.

Pixel statistics
Regions and lines of interest Mean SD The five smaller squares represent local regions from which pixel statistics will be computed. The 4 central lines (rows and columns) are also used for a similar purpose. 1 The mean and standard deviation of the pixels comprising each of these regions are used as two separate features. There are 6 regions giving 12 features, F 1 to F 12 . We also use pixels along the main axes (4 lines) of the input field, giving features F 13 to F 20 . In addition to these pixel statistics, we use a terminal which generates a random constant in the range [0, 255]. This corresponds to the range of pixel intensities in grey-level images.
These pixel statistics have the following characteristics.
(i) They are symmetrical. 1 These lines can be considered special local regions. If the input field size n is an even number, each of these "lines" is a rectangle consisting of two rows or two columns of pixels.
(ii) Local regional features (from small squares and lines) are included. This assists the finding of object centres in the sweeping procedure-if the evolved program is considered as a moving window template, the match between the template and the subimage forming the input field will be better when the moving template is close to the centre of an object. (iii) They are domain-independent and easy to extract.
These features belong to the pixel level and can be part of a domain-independent preexisting feature library of terminals from which the GP evolutionary process is expected to automatically learn and select only those relevant to a particular domain. This is quite different from the traditional image processing and computer vision approaches where the problem-specific features are often needed. (iv) The number of these features is fixed. In this approach, the number of features is always twenty no matter what size the input field is. This is particularly useful for the generalisation of the system implementation.

Terminal set II-circular features
The second terminal set is based on a number of circular features, as shown in Figure 4. The features were computed based on a series of concentric circles centred in the input field. This terminal set focused on boundaries rather than regions. The gap between the radii of two neighbouring circles is one pixel. For instance, if the input field is 19 × 19 pixels, then the number of central circles will be 19/2 + 1 = 10 (the central pixel is considered as a circle with a zero radius); accordingly, there would be 20 features. Compared with the rectilinear terminal set, the number of these circular features in this terminal set depends on the size of the input field.

Terminal set III-pixels
The goal of this terminal set is to investigate the use of raw pixels as terminals in GP. To decrease the computation cost, we considered a 2 × 2 square, or 4 pixels, as a single pixel. The average value of the 4 pixels in the square was used as the value of this pixel, as shown in Figure 5.

The function sets
We used two different function sets in the experiments: 4 arithmetic operations only, and a combination of arithmetic and transcendental functions.

Function set I
In the first function set, the 4 standard arithmetic operations were used to form the nonterminal nodes: The +, −, and * operators have their usual meaningsaddition, subtraction, and multiplication, while / represents "protected" division which is the usual division operator n n n/2 n/2 n/2 n/2 Squares: Rows and columns (lines): Size of the lines: User defined; Default = n/2 Figure 3: The input field and the image regions and lines for feature selection in constructing terminals.  except that a divide by zero gives a result of zero. Each of these functions takes two arguments. This function set was designed to investigate whether the 4 standard arithmetic functions are sufficient for the multiple-class object detection problems.
A generated program consisting of the 4 functions and a number of rectilinear terminals is shown in Figure 6. The LISP form of this program is shown in Figure 7.
This program performed particularly well for the coin images.

Function set II
We also designed a second function set. We hypothesized that convergence might be quicker if the function values were close to the range (−1, 1) and more functions might lead to better results if the 4 arithmetic functions were not sufficient. We introduced some transcendental functions, that is, the absolute function dabs, the trigonometric sine function sin, the logarithmetic function log, and the exponent (to base e) function exp, to form the second function set: FuncSet2 = {+, −, * , /, dabs, sin, log, exp}. (2)

Object classification strategy
The output of a genetic program in a standard GP system is a floating point number. Genetic programs can be Figure 6: A generated program for the coin detection problem. used to perform one-class object detection tasks by utilising the division between negative and nonnegative numbers of a genetic program output. For example, negative numbers can correspond to the background and nonnegative numbers to the objects in the (single) class of interest. This is similar to binary classification problems in standard GP where the division between negative and nonnegative numbers acts as a natural boundary for a distinction between the two classes. Thus, genetic programs generated by the standard GP evolutionary process primarily have the ability to represent and process binary classification or oneclass object detection tasks. However, for the multiple-class object detection problems described here, where more than two classes of objects of interest are involved, the standard GP classification strategy mentioned above cannot be applied.
In this approach, we develop a different strategy which uses a program classification map, as shown in Figure 8, for the multiple-class object detection problems. Based on the output of an evolved genetic program, this map can identify which class of the object located in the current input field belongs to. In this map, m refers to the number of object classes of interest, v is the output value of the evolved program, and T is a constant defined by the user, which plays a role of a threshold.

The fitness function
Since the goal of object detection is to achieve both a high DR and a low FAR, we should consider a multiobjective fitness function in our GP system for multiple-class object detection problems. In this approach, the fitness function is based on a combination of the DR and the FAR on the images in the training set during the learning process. Figure 9 shows the object detection procedure and how the fitness of an evolved genetic program is obtained.
The fitness of a genetic program is obtained as follows.
(1) Apply the program as a moving n×n window template (n is the size of the input field) to each of the training images and obtain the output value of the program at each possible window position. Label each window position with the "detected" object according to the object classification strategy described in Figure 8. Call this data structure a detection map. An object in a detection map is associated with a floating point program output. (2) Find the centres of objects of interest only. This is done as follows. Scan the detection map for an object of interest. When one is found, mark this point as the centre of the object and continue the scan n/2 pixels later in both horizontal and vertical directions. (3) Match these detected objects with the known locations of each of the desired true objects and their classes. A match is considered to occur if the detected object is within tolerance pixels of its known true location. A tolerance of 2 means that an object whose true location is (40,40) would be counted as correctly located at (42,38) but not at (43, 38). The tolerance is a constant parameter defined by the user. (4) Calculate the DR and the FAR of the evolved program. (5) Compute the fitness of the program as follows: . .  where W f and W d are constant weights which reflect the relative importance of FAR versus DR. 2 With this design, the smaller the fitness, the better the performance. Zero fitness is the ideal case, which corresponds to the situation in which all of the objects of interest in each class are correctly found by the evolved program without any false alarms.

Main parameters
Once a GP system has been created, one must choose a set of parameters for a run. Based on the roles they play in the learning/evolutionary process, we group these parameters into three categories: search parameters, genetic parameters, and fitness parameters.

Search parameters
The search parameters used here include the number of individuals in the population (population-size), the maximum depth of the randomly generated programs in the initial population (initial-max-depth), the maximum depth permitted for programs resulting from crossover and mutation operations (max-depth), and the maximum generations the evolutionary process can run (max-generations). These parameters control the search space and when to stop the learning process. In theory, the larger these parameters, the more the chance of success. In practice, however, it is impossible to set them very large due to the limitations of the hardware and high cost of computation.
There is another search parameter, the size of the input field (input-size), which decides the size of the moving window in which a genetic program is computed in the program sweeping procedure.

Genetic parameters
The genetic parameters decide the number of genetic programs used/produced by different genetic operators in the mating pool to produce new programs in the next generation. These parameters include the percentage of the best individuals in the current population that are copied unchanged to the next generation (reproduction-rate), the percentage of individuals in the next generation that are to be produced by crossover (cross-rate), the percentage of individuals in the next generation that are to be produced by mutation (mutation-rate = 100%−reproduction-rate − cross-rate), the probability that, in a crossover operation, two terminals will be swapped (cross-term), and the probability that, in a crossover operation, random subtrees will be swapped (cross-func = 100% − cross-term).

Fitness parameters
The fitness parameters include a threshold parameter (T) in the object classification algorithm, a tolerance parameter (tolerance) in object matching, and two constant weight parameters (W f and W d ) reflecting the relative importance of the DR and the FAR in obtaining the fitness of a genetic program.

Parameter values
Good selection of these parameters is crucial to success. The parameter values can be very different for various object detection tasks. However, there does not seem to be a reliable way of a priori deciding these parameter values. To obtain good results, these parameter values were carefully chosen through an empirical search in experiments. Values used are shown in Table 3.
For detecting circles and squares in the easy images, for example, we set the population size to 100. On each iteration, 10 programs are created by reproduction, 65 programs by crossover, and 25 by mutation. Of the 65 crossover programs, 10 (15%) are generated by swapping terminals and 55 (85%) by swapping subtrees. The programs are randomly initialised with a maximum depth of 4 at the beginning and the depth can be increased to 8 during the evolutionary process. We also use 100, 50, 1000, and 2 as the constant parameters T, W f , W d , and tolerance, which are used for the program classification and the calculation of the fitness function. The maximum generation permitted for the evolutionary process is 100 for this detection problem. The size of the input field is the same as that used in the NN approach [12], that is, 14 × 14.

Termination criteria
In this approach, the learning/evolutionary process is terminated when one of the following conditions is met.

THE IMAGE DATABASES
We used three different databases in the experiments. Example images and key characteristics are given in Figure 10. The databases were selected to provide detection problems of increasing difficulty. Database 1 (easy) was generated to give well-defined objects against a uniform background. The pixels of the objects were generated using a Gaussian generator with different means and variances for each class. There are three classes of small objects of interest in this database: black circles (class1), grey squares (class2), and white circles (class3). The Australian coin images (database 2) were intended to be somewhat harder and were taken with a CCD camera over a number of days with relatively similar illumination. In these images, the background varies slightly in different areas of the image and between images, and the objects to be detected are more complex, but still regular. There are 4 object classes of interest: the head side of 5-cent coins (class head005), the head side of 20-cent coins (class head020), the tail side of 5-cent coins (class tail005), and the tail side of 20cent coins (class tail020). All the objects in each class have a similar size. They are located at arbitrary positions and with some rotations. The retina images (database 3) were taken by a professional photographer with special apparatus at a clinic and contain very irregular objects on a very   cluttered background. The objective is to find two classes of retinal pathologies-haemorrhages and microaneurisms. To give a clear view of representative samples of the target objects in the retina images, one sample piece of these images is presented in Figure 11. In this figure, haemorrhage and microaneurism examples are labeled using white surrounding squares.

EXPERIMENTAL RESULTS
We performed three groups of experiments, as shown in Table 4. The first group of experiments is based on the first two terminal sets (rectilinear features and circular features) and the first function set (the 4 standard arithmetic functions). The second group of experiments uses the third terminal set consisting of raw "pixel" and the first function set. The third group of experiments uses the first terminal set consisting of rectilinear features and the second function set consisting of additional transcendental functions. In these experiments, 4 out of 10 images in the easy image database are used for training and 6 for testing. For the coin images, 10 out of 20 are used for training and 10 for testing. For the retina images, 10 are used for training and 5 for testing. The total number of objects is 300 for the easy image database, 400 for the Australian coin images, and 328 for the retina images. The results presented in this section were achieved by applying the evolved genetic programs to the images in the test sets.

Experiment I
This group constitutes the major part of the investigation. The main goal here is to investigate whether this GP approach can be applied to multiple-class object detection problems of increasing difficulty. The parameters used in these experiments are shown in Table 3 (Section 3.6.4). The average performance of the best 10 genetic programs (evolved from 10 runs) for the easy and the coin databases, and the average performance of the best 5 genetic programs (out of 5 runs, due to the high computational cost) for the retina images are presented.
The results are compared with those obtained using an NN approach for object detection on the same databases [12,39]. The NN method used was the same as the GP method shown in Section 1.1, except that the evolutionary process was replaced by a network training process in step (3) and the generated genetic program was replaced by a trained network. In this group of experiments, the networks also used the same set of pixel statistics as TermSet1 (rectilinear) as inputs. Considerable effort was expended in determining the best network architectures and training parameters. The results presented here are the best results achieved by the NNs and we believe that the comparison with the GP approach is a fair one. Table 5 shows the best results of the GP approach with the two different terminal sets (GP1 with TermSet1, GP2 with TermSet2) and the NN method for the easy images. For class1 (black circles) and class3 (grey circles), all the three methods achieved a 100% DR with no false alarms. For class2 (grey squares), the two GP methods also achieved 100% DR with zero false alarms. However, the NN method had an FAR of 91.2% at a DR of 100%.

Coin images
Experiments with coin images gave similar results to the easy images. These are shown in Table 6. Detecting the heads and tails of 5 cents (class head005, tail005) appears to be relatively straight forward. All the three methods achieved a 100% DR without any false alarms. Detecting heads and tails of 20cent coins (class head020, tail020) is more difficult. While the NN method resulted in many false alarms, the two GP methods had much better results. In particular, the GP1 method achieved the ideal results, that is, all the objects of interest were correctly detected without any false alarms for all the 4 object classes.

Retina images
The results for the retina images are summarised in Table 7. Compared with the results for the other image databases, these results are not satisfactory. 3 However, the FAR is greatly improved over the NN method.
The results over the three databases show similar patterns: the GP-based method always gave a lower FAR than the NN approach for the same detection rate. While GP2 also gave the ideal results for the easy images, it produced a higher FAR on both the coin and the retina images than the GP1 method. This suggests that the local rectilinear features are more effective for these detection problems than the circular features.

Training times
We performed these experiments on a 4-processor ULTRA-SPARC4. The training times for the three databases are very   different due to various degrees of difficulty of the detection problems. The average training times used in the GP evolutionary process (GP1) for the easy, the coin, and the retina images are 2 minutes, 36 hours, and 93 hours, respectively. 4 This is much longer than the NN method, which took 2 minutes, 35 minutes, and 2 hours on average. However, the GP method gave much better detection results on all the three databases. This suggests that the GP method is particularly applicable to tasks where accuracy is the most important factor and training time is seen as relatively unimportant.

Experiment II
Instead of using rectilinear and circular features (pixel statistics) as in experiment I, experiment II directly uses the pixel values as terminals (the third terminal set). For the input field sizes of 14 × 14, 24 × 24, and 16 × 16, for the easy, the coin, and the retina images, the number of terminals are 49 (7×7), 144 (12×12), and 64 (8×8), respectively. For the easy images, the learning took about 70 hours on a 4-processor ULTRA-SPARC4 machine to reach perfect detection performance on the training set and 78 generations were taken. The population size used was 1000, the maximum depth of the program was 30, the maximum initial depth 10, the maximum number of generations 100. For the coin images and the retina images, the situation was worse. Since a large number of terminals were used, the maximum depth of the program trees was increased to 50 for the coin images and 60 for the retina images. The population size for both databases used was 3000 with a maximum number of generations of 100. The evolutionary process took three weeks to complete 50 generations for the coin images and five weeks to complete 50 generations for the retina images. The best detection results were overall 22% FAR at a 100% DR for the coin images, and about 850% FAR at a DR of 100% for microaneurisms in the retina images. While these results are worse than those obtained by the GP1 and GP2 using the rectilinear and circular features, they are still better than the NN approach. If we use a larger population (e.g., 10000 or 50000), a larger program size (e.g., 100), and a larger number of generations (e.g., 300), the results could be better according to our experience. While this is not possible to investigate with the current hardware we use, it shows a promising future direction with the improvement and development of more powerful hardware, for example, parallel or genetic hardware.

Experiment III
Instead of using the four standard arithmetic functions, this experiment focused on using the extended function set (FuncSet2), as shown in Section 3.3.2. The parameters shown in Table 3 (Section 3.6.4) were used in this experiment. The best detection results for the three databases are shown in Table 8.
As can be seen from Table 8, this function set also gave ideal results for the easy and the coin images and a better result for the retina images. The best DR for detecting micro is 100% with a corresponding FAR of 463%. The best DR for haem is still 73.91% but the FAR is reduced to 1214%. In addition, convergence was slightly faster for training the coin and retina images. This suggests that dabs, sin, log, and exp are particularly useful for more difficult problems.

Analysis of results on the retina images
The GP-based approach achieved the ideal results on the easy images and the coin images, but resulted in some false alarms on the retina images, particularly for the detection of objects in class haem in which the FAR was very high and more than a quarter of the real objects in this class were not detected by the evolved genetic program.
We identified two possible reasons for the results on the retina images being worse than the results on the easy and the coin images. The first reason concerns the complexity of the background. In the easy and coin images, the background is relatively uniform, whereas in the retina images it is highly cluttered. In particular, the background of the retina images contains many objects, such as veins and other anatomical features, that are not members of the two classes of interest (microaneurisms and haemorrhages). These objects of noninterest must be classified as "background," in just the same way as the genuine background. The more complex the boundary between classes in the input space, the more complex an evolved program has to be to distinguish the classes. It may be that the more complex background class in the retina images requires a more complex evolved program than the GP system was able to discover. It may even be that the set of terminals and functions is not adequate/sufficient to represent an evolved program to distinguish the objects of interest from such a rich background.
The second possible reason concerns the variation in size of the objects. In the easy and coin images, all of the objects in a class have similar sizes, whereas in the retina images, the sizes of the objects in each class vary. This variation means that the evolved genetic program must cover a more complicated region of the input space. The sizes of the micro objects vary from 3 × 3 to 5 × 5 pixels and the sizes of the haem objects vary from 6 × 6 to 14 × 14 pixels. Given the size of the input field (16 × 16) and the choice of terminals, the variance in the size of the haem objects is particularly problematic since it ranges from just one quarter of the input field (hence entirely inside the central detection region) to almost the entire input field. The fact that the performance on the haem class is worse than the performance on the micro class (especially in experiment III) provides Program 1 Figure 12: Three sample generated programs for simple object detection in the easy images. additional evidence that the size variation is a cause of the poor performance.
The first reason suggests that the current approach is limited on images containing cluttered backgrounds. One possible modification to address this limitation is to evolve multiple programs rather than a single program, either having a separate program for each class of interest, or having several programs to exclude different parts of the background. Another possible modification is to extend the terminal set and/or function set to enrich the expressive power of the evolved programs.
The second reason suggests that the current approach has limited applicability to scale invariant detection problems. This would not be surprising, given the current set of terminals and functions. In particular, although the pixel statistics used in the rectilinear and circular terminal sets are robust to small variations in scale, they are not robust to large variations. We will explore alternative pixel statistics that are more robust to scale variations, and also function sets that would allow disjunctive programs that could better represent classes that contained objects of several different size ranges.

Analysis of evolved programs
This section gives a brief analysis of the best generated programs for the three databases. The genetic programs evolved by the GP1 in experiment I are used as examples. Figure 12 shows three good sample evolved programs for the easy images. (These programs were the direct mathematical conversion of the original LISP format programs evolved by the evolutionary process. The LISP format of the first program is, for example, shown in Figure 13. Note that we did not simplify them-simplification of evolved genetic programs is beyond the goal of this paper.) All of these programs achieved the ideal results: all of the circles and squares were correctly detected with no false alarms.

Easy images
There are several things we can note about these programs. Firstly, the programs are not trivial, and are decidedly nonlinear. It is hard to interpret these programs even for the easy images. Secondly, the programs use many, but not all, of the terminals, but do not use any constants. There are no groups of the terminals that are unused-both the means and standard deviations of both the square regions and the lines are used in the programs, so it does not appear that any of the terminals could be safely removed. Thirdly, although the programs are not in their simplest form (e.g., the factor F 5 /F 5 could be removed from the first program), there is not a large amount of redundancy, so that the GP search is finding reasonably efficient programs.

Coin images
In addition to the program shown in Figure 6, we present another generated program in Figure 14, which also performed perfectly for the coin images.
Compared with those for the easy images, these programs are more complex, which reflects the greater difficulty of the detection problem in the coin images. One difference is that these programs also contain constants. The set of possible programs is considerably expanded by allowing constants as well as the terminals, but the search for good values for the Figure 14: A sample generated program for regular object detection in the coin images.
constants is difficult. Our current GP is biased so that constants are only introduced rarely, but it is clear that the detection problem on the coin images is sufficiently difficult to require some of these constants.

Retina images
One evolved genetic program for the retina images is presented in Figure 15. (The program is presented in LISP format rather than standard format because of its complexity.) This program is much more complex than any of the programs for the easy and the coin images. The program uses all 20 terminals and 8 constants. It does not seem possible to make any meaningful interpretation of this program. It may be that with high-level, domain-specific features and domain-specific functions, it would be possible for the GP system to construct simpler and more interpretable programs; however, this would be against one of the goals of this paper which is to investigate domain-independent approaches.
Even the best programs for the retina images gave quite a high number of false alarms, and it appears that the 20 terminals and 4 standard arithmetic functions are not sufficient for constructing programs for such difficult detection problems. Nonetheless, the program above still had much better performance than an NN with the same input features.

Analysis of classification strategy
As described in Figure 8, we used a program classification map as the classification strategy. In this map, a constant T was used to give "fixed"-size ranges for determining the classes of those objects from the output of the program. The parameter can be regarded as a threshold or a class boundary parameter. Using just a single value for T forces most of the classes to have an equal possible range in the program output, which might lead to a relatively long time of evolution. A natural question to raise is whether we can replace the single parameter T with a set of parameters, say, T 1 , T 2 , . . . , T m , one for each class of interest.
To answer this question, we ran a set of experiments on the easy images with three parameters, T 1 , T 2 , and, T 3 , for the thresholds in the program classification map. The experiments showed that some sets of values of the parameters resulted in an ideal performance but other sets of values did not. Also, the learning/evolutionary process converged very fast with some sets of values but very slowly with others. However, the results of the experiments gave no guidelines for selecting a good set of values for these parameters. In some cases, using separate parameters for each threshold may lead to a better performance than using a single parameter, but appropriate values for the parameters need to be empirically determined. In practice, this is difficult because there is no a priori knowledge in most cases for setting these parameters.
We also tried an alternative classification strategy, which we called multiple binary map, to classify multiple classes of objects. In this method, we convert a multiple-class classification problem to a set of binary classification problems. Given a problem L with m classes L = {c 1 , c 2 , . . . , c m }, the problem is decomposed into L 1 = {c 1 , other}, L 2 = {c 2 , other}, . . ., L m = {c m , other}, where c i denotes the ith class of interest and other refers to the class of nonobjects of interest. In this way, a multiple-class object detection problem is decomposed into a set of one-class object detection tasks, and GP is applied to each of the subsets to obtain the detection result for a particular class of interest. We tested this method on the detection problems in the three image databases and the results were similar to those of the original experiments.
One disadvantage of this method is that several genetic programs have to be evolved. On the other hand, the genetic programs may be simpler, which may reduce the training time for each program. In fact, for the coin images problem, a considerably shorter total training time was required to create a set of one-class programs than to create a single multiple-class program. A more detailed discussion of this method is outside the goal of this paper, and is left to future work.

Analysis of reproduction
In early GP, the reproduction rule did a probabilistic selection of genetic programs from the current population based on their fitness and allowed them to survive by copying them into the new population. The better the fitness, the more likely the individual program is to be selected [24,42]. However, this mechanism does not guarantee that the best program will survive. An alternative reproduction rule is one that removes the probabilistic element, and simply reproduces the best n genetic programs from the current population. We ran experiments on the easy images with both reproduction rules and plotted the best fitness in each generation (see Figure 17). The dotted curve shows the best fitness with the probabilistic reproduction rule. Over the 100 generations, there are 4 clear intervals (at generation 7, 22, 45, and 67) where the fitness got worse rather than better, which delayed the convergence of learning. In contrast, the deterministic reproduction rule had a steady improvement in fitness. Furthermore, the deterministic reproduction rule converged on an ideal program after just 71 generations, while the probabilistic reproduction rule had still not converged on an ideal program after 100 generations. (In fact, the fitness did not improve at all during the final 30 generations!) Clearly, the new reproduction rule greatly improved the training speed and convergence.

CONCLUSIONS
The goal of this paper was to develop a domain-independent, learning/adaptive approach for detecting small objects of multiple classes in large images based on GP. This goal was achieved by the use of GP with a set of domain-independent pixel statistics as terminals, a number of standard operators as functions, and a linear combination of the DR and FAR as the fitness measure. A secondary goal was to compare the performance of this method with an NN method. Here the GP approach outperformed the NN approach in terms of detection accuracy. The approach appears to be applicable to detection problems of varying difficulty as long as the objects are approximately the same size and the background is not too cluttered.
The paper differs from most work in object detection in two ways. Most work addresses the one-class problem, that is, object versus nonobject, or object versus background. This paper has shown a way of solving a multiple-class object detection problem without breaking it into a collection of one-class problems. Also, most current research uses different algorithms in multiple independent stages to solve the localisation problem and the classification problem; in contrast, this paper uses a single learned genetic program for both object classification and object localisation. The experiments showed that mutation does play an important role in the three multiple-class object detection tasks. This is in contrast to Koza's early claim that GP does not need mutation. For GP applied to multiple-class object detection problems, the experiments suggest that a 15%-30% mutation rate would be a good choice.
The experiments also identified some limitations of the particular approach taken in the paper. The first limitation concerns the choice of input features and the function set. For the simple and medium-difficulty object detection problems, the 20 regional/rectilinear features and 4 standard arithmetic functions performed very well; however, they were not adequate for the most difficult object detection task. In particular, they were not adequate for detecting classes of objects with a range of sizes. Further work will be required to discover more effective domain-independent features and function sets, especially ones that provide some size invariance.
A second limitation is the high training time required. One aspect of this training time is the experimentation required to find good values of the various parameters for each different problem. The GP method appears to be applicable to multiple-class object detection tasks where accuracy is the most important factor and training time is seen as relatively unimportant, as is the case in most industrial applications. Further experimentation may reveal more effective ways of determining parameters which will reduce the training times.
Subject to these limitations, the paper has demonstrated that GP can be used effectively for the multiple-class detection problem and provides more evidence that GP has a great potential for application to a variety of difficult problems in the real world.