Unsupervised Performance Evaluation of Image Segmentation

We present in this paper a study of unsupervised evaluation criteria that enable the quantification of the quality of an image segmentation result. These evaluation criteria compute some statistics for each region or class in a segmentation result. Such an evaluation criterion can be useful for different applications: the comparison of segmentation results, the automatic choice of the best fitted parameters of a segmentation method for a given image, or the definition of new segmentation methods by optimization. We first present the state of art of unsupervised evaluation, and then, we compare six unsupervised evaluation criteria. For this comparative study, we use a database composed of 8400 synthetic gray-level images segmented in four different ways. Vinet's measure (correct classification rate) is used as an objective criterion to compare the behavior of the different criteria. Finally, we present the experimental results on the segmentation evaluation of a few gray-level natural images.


INTRODUCTION
Segmentation is an important stage in image processing since the quality of any ensuing image interpretation depends on it. Several approaches have been put forward in the literature [1,2], . . .. The region approach for image segmentation consists in determining the regions containing neighborhood pixels that have similar properties (gray-level, texture,. . .). The contour approach detects the boundaries of these regions. We have decided to focus on the first approach, namely the region-based image segmentation, because the corresponding segmentation methods give better results in the textured case (the most difficult one). Classification methods can be used afterwards. In this case, a class can be composed of different regions of the segmentation result.
However, it is difficult to evaluate the efficiency and to make an objective comparison of different segmentation methods. This more general problem has been addressed for the evaluation of a segmentation result and the results are available in the literature [3]. There are two main approaches.
On the one hand, there are supervised evaluation criteria based on the computation of a dissimilarity measure between a segmentation result and a ground truth. These criteria are widely used in medical applications [4]. Baddeley's distance [5], Vinet's measure [6] (correct classification rate), or Hausdorff 's measure [7] are examples of supervised evaluation criteria. For the comparison of these criteria, it is possible to use synthetic images whose ground truth is directly available. An alternative solution is to use the segmentation results manually made by experts on natural images. This strategy is more realistic if we consider the type of images, but the question of the different experts objectivity then arises. This problem can be solved by merging the segmentation results obtained by the different experts [8] and by taking into account their subjectivity.
On the other hand, there are unsupervised evaluation criteria that enable the quantification of the quality of a segmentation result without any a priori knowledge. These criteria generally compute statistical measures such as the graylevel standard deviation or the disparity of each region or class in the segmentation result. Currently, no evaluation criterion appears to be satisfactory in all cases. In this paper, we present and test different unsupervised evaluation criteria. They will allow us to compare various segmentation results, to make the choice of the segmentation parameters easier, or to define new segmentation methods by optimizing an evaluation criterion. A segmentation result is defined by a level of precision. When using a classification method, we believe that the best way to define the level of precision of a segmentation result is the number of its classes. We use the unsupervised evaluation criteria for the comparison of the segmentation results of an image that have the same precision level.
In Section 2, we present the state of the art of unsupervised evaluation criteria and highlight the most relevant ones. In Section 3, we compare the chosen criteria in order to evaluate their respective advantages and drawbacks. The comparison of these unsupervised criteria is first carried out in a supervised framework on synthetic images. In this case, the ground truth is obviously well known and the best evaluation criterion will be the one that maximizes the similarity of comparison with Vinet's measure. We then illustrate the ability of these evaluation criteria to compare various segmentation results (with the same level of precision) of real images in Section 4. We conclude and give the perspectives of this study in Section 5.

UNSUPERVISED EVALUATION
Without any a priori knowledge, most of evaluation criteria compute some statistics on each region or class in the segmentation result. The majority of these quality measurements are established in agreement with the human perception. There are two main approaches in image segmentation: region segmentation and boundary detection. As we chose to more specifically consider region-based image segmentation methods, which give better results for textured cases, the corresponding evaluation criteria will be detailed in the next paragraph.

Evaluation of region segmentation
One of the most intuitive criterion being able to quantify the quality of a segmentation result is the intraregion uniformity. Weszka and Rosenfeld [9] proposed such a criterion with thresholding that measures the effect of noise to evaluate some thresholded images. Based on the same idea of intraregion uniformity, Levine and Nazif [10] also defined a criterion that calculates the uniformity of a region characteristic based on the variance of this characteristic: where (i) I R corresponds to the segmentation result of the image I in a set of regions R = {R 1 , . . . , R NR } having N R regions, (ii) Card(I) corresponds to the number of pixels of the image I, (iii) g I (s) corresponds to the gray-level intensity of the pixel s of the image I and can be generalized to any other characteristic (color, texture, . . .).
A standardized uniformity measure was proposed by Sezgin and Sankur [11]. Based on the same principle, the measurement of homogeneity of Cochran [12] gives a confidence measure on the homogeneity of a region. However, this method requires a threshold selection that is often arbitrarily done, limiting thus the proposed method. Another criterion to measure the intraregion uniformity was developed by Pal and Pal [13]. It is based on a thresholding that maximizes the local second-order entropy of regions in the segmentation result. In the case of slightly textured images, these criteria of intraregion uniformity prove to be effective and very simple to use. However, the presence of textures in an image often generates improper results due to the overinfluence of small regions.
Complementary to the intraregion uniformity, Levine and Nazif [10] defined a disparity measurement between two regions to evaluate the dissimilarity of regions in a segmentation result. The formula of total interregions disparity is defined as follows: where w Rk is a weight associated to R k that can be dependent of its area, for example,ḡ k is the average of the gray-level of R k .ḡ I (R k ) can be generalized to a feature vector computed on the pixels values of the region R k such as for LEV 1. p Rk\Rj corresponds to the length of the perimeter of the region R k common to the perimeter of the region R j . This type of criterion has the advantage of penalizing the oversegmentation. Note that the intraregion uniformity can be combined with the interregions dissimilarity by using the following formula: where C 2 NR is number of combinations of 2 regions among N R .
This criterion [14] combines intra and interregions disparities. intraregion disparity is computed by the normalized standard deviation of gray levels in each region. The interregions disparity computes the dissimilarity of the average gray level of two regions in the segmentation result.
Haralick and Shapiro consider that (i) the regions must be uniform and homogeneous, (ii) the interior of the regions must be simple without too many small holes, (iii) the adjacent regions must present significantly different values for the uniform characteristics, (iv) boundaries should be smoothed and accurate.
The presence of numerous regions in a segmentation result is penalized only by the term N R . In the case of very noisy images, the excess in the number of regions should be penalized. However, the error generated by each small region is close to 0. Consequently, the global criterion is also close to 0, which means that the segmentation result is very good in an erroneous way. Borsotti et al. [15] identified this limitation of Liu and Yang's evaluation criterion [16] and modified it, so as to more strictly penalize the segmentation results presenting many small regions as well as heterogeneous ones. These modifications permit to make the criterion more sensitive to small variations of the segmentation result: where χ(Card(R k )) corresponds to the number of regions having the same area Card(R k ), E k is defined as the sum of the Euclidean distances between the RGB color vector of the pixels of R k and the color vector attributed to the region R k in the segmentation result. Zeboudj [17] proposed a measure based on the combined principles of maximum interregions disparity and minimal intraregion disparity measured on a pixel neighborhood. One defines c(s, t) = |g I (s) − g I (t)|/(L − 1) as the disparity between two pixels s and t, with L being the maximum of the gray level. The interior disparity CI(R i ) of the region R i is defined as follows: where Card(R i ) corresponds to the area of the region R i and W(s) to the neighborhood of the pixels. External disparity CE(i) of the region R i is defined as follows: where p i is the length of the boundary F i of the region R i . Lastly, the disparity of the region R i is defined by the measurement C(R i ) ∈ [0, 1] expressed as follows: Zeboudj's criterion is defined by This criterion has the disadvantage of not correctly taking into account strongly textured regions.
Considering the types of regions (textured or uniform) in the segmentation result, Rosenberger presented in [14,18] a criterion that enables to estimate the intraregion homogeneity and the interregions disparity. This criterion quantifies the quality of a segmentation result as follows: where D(I R ) corresponds to the total interregions disparity that quantifies the disparity of each neighbor region of the image I. The total intraregion disparity denoted by D(I R ) computes the homogeneity of each region of the image I: where D(R i ) is the intraregion disparity of the region R i . D(I R ) has a similar definition.

Intraregion disparity
The intraregion disparity D(R i ) is computed considering the textured or uniform type of the region R i . This determination is made according to some statistical computation on the cooccurrence matrix of the gray-level intensity of the pixels in the region R i . More details about this computation can be found in [18].
In the uniform case, the intraregion disparity is equal to the normalized standard deviation of the region. This statistic of order 2 on the dispersion of the gray levels in a region is sufficient to characterize the intraclass disparity of a uniform region.
If the region is textured, the standard deviation does not give reliable information on its homogeneity. A more complex process based upon texture attributes and clustering evaluation is used instead. A procedure detailed in [18] is followed to compute the homogeneity of each textured region in the segmentation result.
Briefly stated, a region containing two different primitives must have a high intraregion disparity compared to the same region composed of a single primitive. So, a dispersion measure of the Haralick and Shapiro texture attributes determined into each region is computed.

Interregions disparity
The total interregions disparity D(R I ) that measures the disparity of each region depending on the type of each region (uniform or textured) is defined as follows: where D(R i ) is the interregions disparity of the region R i . The interclass disparity computes the average dissimilarity of a region with its neighbors. The interregions disparity of two neighboring regions is also computed by taking their types into account.
(A) Regions of the same type (i) Uniform regions. This parameter is computed as the average of the disparity of a region with its neighbors. The disparity of two uniform regions R i and R j is calculated as whereḡ I (R i ) is the average gray-level in the region R i and NGR is the number of gray-levels in the region. (ii) Textured regions. The disparity of two textured regions R i and R j is defined as where G i is the average parameters vector describing the region R i (corresponds toḡ I (R i ) in the uniform case and to the average value of the Haralick and Shapiro texture attributes otherwise). · corresponds to the quadratic norm. We could have used a more complex distance such as the Bhattacharya distance but we do not want to make some hypothesis on the probability density functions.

(B) Regions of different types
The disparity of regions of different types is set as the maximal value 1.
Some studies showed the efficiency of this criterion even for segmentation results of textured images [19].

COMPARATIVE STUDY
In this section, we compare different evaluation criteria devoted to region-based segmentation methods, pointing out their respective aspects of interest and limitations. The goal is then to identify the domain of applicability of each criterion.

Experimental protocol
We present here the image database, the segmentation methods, and the evaluation criteria we have used for the different tests.

Image database
We created a database (BCU) composed of synthetic images to compare the criteria values with a supervised criterion (for synthetic images, the ground truth is of course available). It includes 8400 images with 2 to 15 regions (see Figure 1). These images are classified in five groups for each number of regions (see Figure 2): (i) 100 images composed of 100% textured regions (B0U), (ii) 100 images composed of 75% textured regions and 25% uniform regions (B25U), (iii) 100 images composed of 50% textured regions and 50% uniform regions (B50U), (iv) 100 images composed of 25% textured regions and 75% uniform regions (B75U), (v) 100 images composed of 100% uniform regions (B100U), (vi) 100 images composed of 100% textured regions with the same mean gray level for each region (B0UN).
The textures used to create this image database were randomly extracted from the Oulu's University texture database (http://www.outex.oulu.fi).

Segmentation results
The segmentation methods we used are classification-based. Each image of the database is segmented by the fuzzy Sebastien Chabrier et al. K-means method [20] with a number of classes corresponding to the number of regions of its ground truth. The second segmentation method is a relaxation [13] of this segmentation result that improves the quality of the result in almost all the cases. As third segmentation method, we used the EDISON one [21] which uses the "mean shift" algorithm developed by Georgescu and his colleagues (http://www.caip.rutgers.edu/ riul/research/code/EDISON/). In order to keep a similar level of precision (number of classes) between all the segmentation results, we classified this segmentation result using the LBG algorithm [22]. The fourth segmentation result we consider is simply the best one available: the ground truth. Figure 3 presents an image with 8 regions from the database and the four corresponding segmentation results. As we can see in this figure, these segmentation results have different qualities.
The intrinsic quality of the segmentation results we used for the comparison of evaluation criteria is not so important. Indeed, we are looking for an unsupervised evaluation criterion that has a similar behavior to a supervised one used as reference (Vinet's measure). A similar methodology concerning performance measures for video object segmentation can be found in [23].
A good segmentation result maximizes the value of a criterion, except for the Borsotti one that has to be minimized. In order to facilitate the understanding of the proposed analysis, we used 1 − BOR(I R ) as the Borsotti's value instead of BOR(I R ) for each segmentation result I R .
The Vinet's measure [6] that is a supervised criterion which corresponds to the correct classification rate is used as reference for the analysis of the synthetic images. In this case, the ground truth is available. This criterion is often used to compare a segmentation result I R with a ground truth I R ref in the literature. We compute the following superposition table: where card{R i ∩ R ref j } is the number of pixels belonging to the region R i in the segmentation result I R and to the region R j in the ground truth.
With this table, we recursively search the matched classes as illustrated in the Figure 4, for example, according to the following method: (1) we first select into the table the two classes that maxi- the Vinet measure is computed as follows: This criterion is often used to compute correct classification rate of the segmentation result of a synthetic image.

Experimental results
In this section, we analyze the previously presented unsupervised evaluation criteria. Their quality is evaluated by considering the comparison similarity with the Vinet measure using their values on segmentation results.

Comparative study
We here look for the evaluation criteria having the most similar behaviors to the Vinet one. In order to achieve this goal, we consider the comparison results of the different segmentation results for all the evaluation criteria. As we have four segmentation results of each image, we have 6 possible comparisons. These 6 possible comparisons of four segmentation results A, B, C, and D are A > B, A > C, A > D, B > C, B > D, C > D. A comparison result is a value in {0, 1}. If a segmentation result has a higher value for the considered evaluation criterion than another one, the comparison value is set to 1 otherwise it is set to 0. In order to define the similarity between each evaluation criterion and the Vinet measure, an absolute difference is measured between the criterion comparison and the Vinet one. We define the cumulative similarity of correct comparison (SCC) as follows: where A(i, k) is the ith comparison result by using the Vinet measure and B(i, k) by an evaluation criterion for the image k (1 < k < 8400).
In order to quantify the efficiency of the evaluation criteria, we define the similarity rate of correct comparison (SRCC), which represents the absolute similarity of comparison with the Vinet measure referenced to the maximal value: where SCC max = 6 × 8400 = 33 600 comparison results. We can visualize in Table 1 the SRCC value of all the criteria with VIN. We can then note that ZEB and LEV 2 have the strongest value of the SRCC in the case of uniform images. In the textured case, LEV 2 is in first position followed by ROS 2 except for the B0UN group. When textured regions have the same mean gray levels, ROS 2 provides better results.
The criteria which obtain the best values of the SRCC in almost all cases are LEV 2, ZEB, and ROS 2. These three criteria are complementary if we consider the type of the original images. Indeed, the more the image contains textured (resp., uniform) regions, the more LEV 2 or ROS 2 (resp., ZEB) is efficient.
We illustrate thereafter the behaviors of the different criteria on various types of images.

Evaluation of segmentation results
We illustrate in this part, the behavior of these evaluation criteria for different types of images. The Vinet measure (correct classification rate), considered as the reference, allows to identify the best segmentation result.
Case of an uniform image. Figure 5 presents an original image with only uniform regions and its four segmentation results. In this case, VIN chooses the ground truth as being the best followed by the EDISON result. As shown in Table 2, only ZEB is able to sort these segmentation results like VIN.
Case of a mixed image. Figure 6 presents an original image with uniform and textured regions from BC50U and its four segmentation results. According to Table 3, LEV 2 and ROS 2 sort correctly the segmentation results except for one comparison.
Case of a textured image. Figure 7 presents an original image with only textured regions from BC0U and its four segmentation results. In this case, ROS 2 is the only criterion that sorts correctly the segmentation results except for one comparison (see Table 4).  Case of a textured image for regions with the same mean gray level. Figure 8 presents an original image with only textured regions with the same mean gray-level from BC0UN and its four segmentation results. According to Table 5, only ROS 2 sorts correctly the segmentation results. We can notice that LEV 2 gives bad results in this case.
As a conclusion of this comparative study, ZEB has to be preferred for uniform images while LEV 2 and ROS 2 are more adapted for mixed and textured ones.

APPLICATION TO REAL IMAGES
We illustrate here the ability of the previous evaluation criteria to compare different segmentation results of a single image at a same level of precision (here the number of classes). Images chosen as illustration in this paper are an aerial and a radar image (see Figure 9). They were segmented by three different methods: FCM [25], PCM [20], and EDISON [21]. The first image corresponds to an aerial image composed of uniform and textured regions ( Figure 10). The majority 8 EURASIP Journal on Applied Signal Processing      (Table 6). In our mind, this is also the case visually. The second image corresponds to a strongly noisy radar image (see Figure 11). The regions can thus be regarded as being all textured. Visually, the best segmentation result of this image is, from our point of view, the EDISON one. Table 7 presents it as being the best in almost all cases. ROS 2 gives to this segmentation result a much better quality score compared to the FCM and PCM ones. On the contrary, ZEB ranks very badly the EDISON segmentation result. Moreover, ZEB still keeps very weak values ( 0.1 whereas for the segmentation results of the other images, the results exceeded 0.7 for the best). It confirms that ZEB is not adapted to strongly textured images.
In order to validate these results on real images, one could make a psychovisual study involving a significant number of experts [8,23].

CONCLUSION
Segmentation evaluation is essential to quantify the performance of the existing segmentation methods. In this paper, the majority of the existing unsupervised criteria for the evaluation and the comparison of segmentation methods are referred and presented. The present study tries to show the strong points, the weak points, and the limitations of some of these criteria. For the comparative study, we used a large database composed of 8400 synthetic images containing from 2 to 15 regions. We thus have 33 600 segmentation results and consequently 50 400 comparisons of segmentation results. We could note that three criteria give better results than the others: ZEB, LEV 2, and ROS 2. ZEB is adapted for uniform images, while LEV 2 and ROS 2 find their applicability for textured images.  We illustrated the importance of these evaluation criteria for the evaluation of segmentation results of real images without any a priori knowledge. The selected criteria were able, in our examples, to choose the segmentation result that was visually perceived as being the best.
A prospect for this work is to combine the best criteria in order to optimize their use in the various contexts. Perspectives of this study concern the application of these evaluation criteria for the choice of the segmentation method parameters or the definition of new segmentation methods by optimizing an evaluation criterion.