Performance Evaluation in Image Processing

The scanning and computerized processing of images hadits birth in 1956 at the National Bureau of Standards (NBS,nowNationalInstituteofStandardsandTechnology(NIST))[1].Image enhancementalgorithms weresome of the ﬁrst tobe developed [2]. Half a century later, literally thousands ofimage processing algorithms have been published. Some ofthese have been speciﬁc to certain applications such as theenhancement of latent ﬁngerprints, whilst others have beenmore generic in nature, applicable to all, yet master of none.The scope of these algorithms is fairly expansive, rangingfrom automatically extracting and delineating regions of in-terest such as in the case of segmentation, to improving theperceived quality of an image, by means of image enhance-ment. Since the early years of image processing, as in manysubﬁelds of software design, there has been a portion of thedesign process dedicated to algorithm testing. Testing is theprocessofdeterminingwhetherornotaparticularalgorithmhassatisﬁeditsspeciﬁcationsrelatingtocriteriasuchasaccu-racy and robustness. A major limitation in the design of im-ageprocessingalgorithmsliesinthediﬃcultyindemonstrat-ingthatalgorithmsworktoanacceptablemeasureofperfor-mance. The purpose of algorithm testing is two-fold. Firstlyit provides either a qualitative or a quantitative method ofevaluating an algorithm. Secondly, it provides a comparativemeasure of the algorithm against similar algorithms, assum-ing similar criteria are used. One of the greatest caveats indesigning algorithms incorporating image processing is howtoconceivethecriteriausedtoanalyzetheresults.Dowede-signacriterionwhichmeasuressensitivity,robustness,orac-curacy? Performance evaluation in the broadest sense refersto a measure of some required behavior of an algorithm,whether it is achievable accuracy, robustness, or adaptabil-ity. It allows the intrinsic characteristics of an algorithm tobe emphasized, as well as the evaluation of its beneﬁts andlimitations.More often than not though, such testing has been lim-ited in its scope. Part of this is attributable to the actual lackof formal process used in performance evaluation of im-age processing algorithms, from the establishment of testingregimes, to the design of metrics. Selection of an appropri-ate evaluation methodology is dependent on the objectiveof the task. For example, in the context of image enhance-ment, requirements are essentially diﬀerent for screen-basedenhancementandenhancementwhichisembeddedwithinasubalgorithm. Screen-based enhancement is usually assessedinasubjectivemanner,whereaswhenanalgorithmisencap-sulated within a larger system, subjective evaluation is notavailable,andthealgorithmitselfmustdeterminethequalityof a processed image. Very few approaches to the evaluationofimageprocessingalgorithmscanbefoundintheliterature,although the concept has been around for decades. A signif-icant diﬃculty which arises in the evaluation of algorithmsis ﬁnding suitable metrics which provide an objective mea-sure of performance. A performance metric is a meaningfuland computable measure used for quantitatively evaluatingthe performance of any algorithm. Consider the process ofassessing image quality. There is no single quantitative met-ric which correlates well with image quality as perceived bythe human visual system. The process of analyzing failure isintrinsically coupled with the process of performance evalu-ation.Inordertoascertainwhetheranalgorithmfailsornot,youhavetodeﬁnethecharacteristicsofsuccess.Failureanal-ysisistheprocessofdeterminingwhyanalgorithmfailsdur-ing testing. The knowledge generated is then fed back to thedesign process in order to engender reﬁnements in the algo-rithm.Thisisadiﬃcultprocessinapplicationssuchasimageenhancement primarily because there is usually no referenceimagewhichcanbeusedasan“ideal”image.Theassessmentofimagequalityplaysanimportantroleinapplicationssuchas consumer electronics. Metrics could be used to monitororoptimizeimagequalityindigitalcameras,benchmarkandevaluate image enhancement algorithms. There is no singlemetric that correlates well with image quality as perceivedby the human visual system. Selection of an appropriate

The scope of these algorithms is fairly expansive, ranging from automatically extracting and delineating regions of interest such as in the case of segmentation, to improving the perceived quality of an image, by means of image enhancement. Since the early years of image processing, as in many subfields of software design, there has been a portion of the design process dedicated to algorithm testing. Testing is the process of determining whether or not a particular algorithm has satisfied its specifications relating to criteria such as accuracy and robustness. A major limitation in the design of image processing algorithms lies in the difficulty in demonstrating that algorithms work to an acceptable measure of performance. The purpose of algorithm testing is two-fold. Firstly it provides either a qualitative or a quantitative method of evaluating an algorithm. Secondly, it provides a comparative measure of the algorithm against similar algorithms, assuming similar criteria are used. One of the greatest caveats in designing algorithms incorporating image processing is how to conceive the criteria used to analyze the results. Do we design a criterion which measures sensitivity, robustness, or accuracy? Performance evaluation in the broadest sense refers to a measure of some required behavior of an algorithm, whether it is achievable accuracy, robustness, or adaptability. It allows the intrinsic characteristics of an algorithm to be emphasized, as well as the evaluation of its benefits and limitations.
More often than not though, such testing has been limited in its scope. Part of this is attributable to the actual lack of formal process used in performance evaluation of image processing algorithms, from the establishment of testing regimes, to the design of metrics. Selection of an appropriate evaluation methodology is dependent on the objective of the task. For example, in the context of image enhancement, requirements are essentially different for screen-based enhancement and enhancement which is embedded within a subalgorithm. Screen-based enhancement is usually assessed in a subjective manner, whereas when an algorithm is encapsulated within a larger system, subjective evaluation is not available, and the algorithm itself must determine the quality of a processed image. Very few approaches to the evaluation of image processing algorithms can be found in the literature, although the concept has been around for decades. A significant difficulty which arises in the evaluation of algorithms is finding suitable metrics which provide an objective measure of performance. A performance metric is a meaningful and computable measure used for quantitatively evaluating the performance of any algorithm. Consider the process of assessing image quality. There is no single quantitative metric which correlates well with image quality as perceived by the human visual system. The process of analyzing failure is intrinsically coupled with the process of performance evaluation. In order to ascertain whether an algorithm fails or not, you have to define the characteristics of success. Failure analysis is the process of determining why an algorithm fails during testing. The knowledge generated is then fed back to the design process in order to engender refinements in the algorithm. This is a difficult process in applications such as image enhancement primarily because there is usually no reference image which can be used as an "ideal" image. The assessment of image quality plays an important role in applications such as consumer electronics. Metrics could be used to monitor or optimize image quality in digital cameras, benchmark and evaluate image enhancement algorithms. There is no single metric that correlates well with image quality as perceived by the human visual system. Selection of an appropriate 2 EURASIP Journal on Applied Signal Processing evaluation methodology is dependent on the objective of the task. In the context of image enhancement, requirements are essentially different for screen-based enhancement and enhancement that is embedded within an algorithm (as a subalgorithm).
The purpose of evaluating an algorithm is to understand its behavior in dealing with different categories of images, and/or help in estimating the best parameters for different applications [3]. Ultimately this may involve some comparison with similar algorithms, in order to rank their performance and provide guidelines for choosing algorithms on the basis of application domain [3]. Assessing the performance of any algorithm in image processing is difficult because performance depends on several factors, as concluded by Heath et al. [4]: (1) the algorithm itself, (2) the nature of images used to measure the performance of the algorithm, (3) the algorithm parameters used in the evaluation, (4) the method used for evaluating the algorithm.
The ease to which an algorithm can be evaluated is directly proportional to the number of parameters it requires. For example, a segmentation algorithm which has no parameters bar, the image to be processed will be easier to evaluate than one which has three parameters which need to be tailored in order to obtain optimal performance. The nature of the image itself also impacts performance. Evaluation with a set "easy" images may produce a higher accuracy than the use of more difficult images containing complex regions. There are no rigid guidelines as to exactly how the process of performance evaluation should be characterized, however there are a number of facets to be considered [5]: testing protocol; testing regime; performance indicators; performance metrics, and image databases.
The first of these, testing protocol relates to the successive approach used to perform testing. There are three basic tenets: visual assessment, statistical evaluation, and ground truth evaluation. The first stage of performance evaluation involves obtaining a qualitative impression of how well an algorithm has performed. For example, when design begins on a new algorithm, a few sample images may be used in a coarse analysis of the usefulness of existing algorithms by means of visual assessment. Visual assessment usually implies comparing the processed image with the original one. Algorithms judged useful at the first stage are investigated in the next stage as to their accuracy using quantitative performance metrics and ground truth data. The "final" stage of evaluation looks at aspects of performance such as robustness, adaptability, and reliability. This process may iterate through a number of cycles. Next is the testing regime which relates to the strategy used for testing the images. There are four basic testing categories. The first of these is exhaustive testing, which is a brute force approach to testing whereby an algorithm is presented with every possible image in a database to test. Such an approach can be overwhelming, and should be limited to the verification component of the design process. Next is boundary value testing, which evaluates a subset of images identified as being representative. The third regime relates to random testing in which images are indiscriminately selected. This relates to a more statistically based process of evaluating an algorithm providing more realistic conditions. For instance, is it realistic to test a mass detection algorithm on a database of mammograms containing only malignant masses and assume it works accurately? What happens when the algorithm is faced with a normal mammogram: will it mark a feature as false-positive? The final testing regime concerns worst-case testing. What happens when an algorithm processes images containing rare or unusual features? Performance evaluation relies on the use of performance indicators. Such indicators convey the qualities of an algorithm. They are often loose characterizations used in the specification of an algorithm, and in themselves are difficult to measure. Typical performance indicators include [5] (1) accuracy: how well the algorithm has performed with respect to some reference; (2) robustness: an algorithm's capacity for tolerating various conditions; (3) sensitivity: how responsive an algorithm is to small changes in features; (4) adaptability: how the algorithm deals with variability in images; (5) reliability: the degree to which an algorithm, when repeated using the same stable data, yields the same result; (6) efficiency: the practical viability of an algorithm (time and space).
Finally there is the notion of the image database: which images should be selected to test an algorithm? This relates to the diversity and complexity of the selected images, how many databases are used in the selection process, and the significance of the images to the segmentation task. The goal of this special issue is to present an overview of current methodologies related to performance evaluation, performance metrics, and failure analysis of image processing algorithms. The first seven papers deal with aspects of performance evaluation in image segmentation, from metrics derived for video object relevance, to skew-tolerance evaluation of page segmentation algorithms and evaluation of edge detection. The last five papers deal with diverse areas of performance evaluation. This includes a methodology for designing experiments for performance evaluation and parameter tuning, the verification and validation of fingerprint registration algorithms, and using performance measures in feedback. As both consumer and commercial electronics evolve, spanning applications as diverse as food processing, biometrics, medicine, digital photography, and home theatres, it is increasingly essential to provide software which is both accurate and robust. This requires a standardized methodology for testing image processing algorithms, and innovative means to tackle quantifying and automatically resolving issues relating to algorithm functioning. The assessment and characterization of image processing algorithms is an emerging field, which has been growing for the past three decades. We hope that this special issue will direct more Michael Wirth et al. 3 energy to the problem of performance evaluation, and revitalize interest in this burgeoning field.