- Research
- Open access
- Published:
One-class kernel subspace ensemble for medical image classification
EURASIP Journal on Advances in Signal Processing volume 2014, Article number: 17 (2014)
Abstract
Classification of medical images is an important issue in computer-assisted diagnosis. In this paper, a classification scheme based on a one-class kernel principle component analysis (KPCA) model ensemble has been proposed for the classification of medical images. The ensemble consists of one-class KPCA models trained using different image features from each image class, and a proposed product combining rule was used for combining the KPCA models to produce classification confidence scores for assigning an image to each class. The effectiveness of the proposed classification scheme was verified using a breast cancer biopsy image dataset and a 3D optical coherence tomography (OCT) retinal image set. The combination of different image features exploits the complementary strengths of these different feature extractors. The proposed classification scheme obtained promising results on the two medical image sets. The proposed method was also evaluated on the UCI breast cancer dataset (diagnostic), and a competitive result was obtained.
1 Introduction
Medical imaging is one of the most important tools in modern medicine; different types of imaging technologies such as X-ray imaging, ultrasonography, biopsy imaging, computed tomography, and optical coherence tomography have been widely used in clinical diagnosis for various kinds of diseases. However, in clinical applications, it is usually time-consuming to examine an image manually. Moreover, as there is always a subjective element related to the pathological examination of an image by human physician, an automated technique will provide valuable assistance for physicians. A large focus with respect to medical image analysis has been on automated image classification. Many recent studies have revealed that medical images can be properly classified if suitable image feature descriptions are chosen [1–3]. These research demonstrated that by combining different image description features, it is possible to improve medical image classification performance.
Although the classifiers which can provide multi-class classification such as support vector machines (SVM) and neural networks are usually selected for medical image classification [4], one-class classifiers (OCC) [5] that can work on the samples seen are, so far, more appropriate for medical image classification task. One-class classification is also often called outlier (or novelty) detection as the learning algorithms are used to differentiate between data that appears normal and abnormal with respect to the distribution of the training data. This principle of one-class classification is thus appropriate with respect to medical diagnosis and in disease versus no-disease problems.
In many real classification tasks, using a single classifier often fails to capture all aspects of the data. Therefore, a combination of classifiers (an ensemble) is often considered to be an appropriate mechanism to address this shortcoming. The main idea behind the ensemble methodology is to use several classifiers and combine the individual results in order to produce a classification that outperforms the outcome that would have been produced if the classifiers were to operate in isolation [6]. Ensembles of one-class classifiers have also been shown to perform better than individual classifiers [7–9].
There are many strategies for constructing a classifier ensemble, with examples including the use of different training data sets, different feature subsets, various types of individual classifiers, and different fusion rules. Among these, the feature subset strategy has shown better performance when the dimensionality of the feature vector is high compared to the number of the data samples [10–13]. It is thus suggested that the feature subset ensemble strategy is consequently well suited to medical image classification problems, as various types of image features are generally extracted for medical image classification tasks, which in turn means that the dimensionality of the vector space is typically beyond the number of image samples, in which the ‘curse of dimensionality’ occurs, but the use of the feature subset strategy can avoid such problem.
In this paper, we propose and evaluate a novel classification scheme for medical images. The proposed classification scheme utilizes an ensemble of one-class classifiers, which is built with the feature subset strategy; each one-class classifier is trained with one type of features extracted from the training images. The kernel principle component analysis (KPCA) model was chosen as the base classifier of the ensemble. Given a m-class classification task and n different kinds of image features, the ensemble will consist of m×n KPCA models. For an unlabeled image, its n-types of features will first be mapped into the kernel space by the corresponding n-trained KPCA models from each class. The mapped features will then be reconstructed from the high dimensional kernel space into the original space by preimage learning, the distances between the original features and the reconstructed features will be measured. The distances given by the KPCA models will be combined to output a confidence score describing the probability of the sample belonging to a class. For a m-class classification task, the m confidence scores will be obtained, one for each class. The image will be classified into the class with the maximum confidence score. Promising classification performance was obtained using the proposed classification scheme on two medical image sets.
2 Related works
In this section, we will first introduce some related works on one-class classification. Then one-class classifier ensembles will be discussed.
2.1 One-class classification
Moya et al. originated the term one-class classification [14]. Many approaches to one-class classification have been presented in the literature [5]. Following the taxonomy in the survey papers of [15–17], the algorithms used in OCC can be categorized as follows: (i) boundary methods, (ii) density estimation, and (iii) reconstruction methods.
Tax and Duin [18, 19] sought to solve the problem of OCC by distinguishing the positive class from all other patterns in the pattern space; the positive class data was surrounded by a hyper-sphere which encompassed almost all positive patterns within the minimum radius. This method of support vector data description (SVDD) was different to that proposed by Schölkopf et al. [20] who, using a separating hyper-plane instead of a hyper-sphere, tried to separate the pattern space with data from the space containing no data. Manevitz and Yousef [21] proposed another version of one-class SVM based on identifying outlier data as representative of the second class, and they applied their method to the standard Reuters[22] dataset and noted that their SVM methods was quite sensitive to the choice of representation and kernel. Although one-class classifiers, such as OCSVM, have been widely used, the estimated boundary can be sensitive to the nature of the data [23]. This can be highly problematic for many applications, especially for medical diagnosis where the number of false positives must be kept to a minimum, since an accidental diagnosis of a cancer patient as healthy may result in death.
Density estimation methods estimate the density of the target class to form a model with which to represent the data. The generally used models include Parzen, Gaussian, and Gaussian mixture models. The test point is classified by the maximum posterior probability. Generally, this approach works well when the sample size is sufficiently high and a flexible model is used. However, when the model does not fit the data very well, a large bias may result. Details and some comparisons of these methods can be found in [24, 25].
As the density estimation or support-vector-based methods require large training sets, when this is not feasible, one can approximate the target class by a simpler reconstruction model. This type of models tries to capture the data structure; new objects are projected onto this model. The reconstruction error, the difference between the original object x and the projected object p(x), indicates the resemblance of a new object to the original target distribution. When the training data has a very high dimensionality, the nearest neighbor methods tend to perform badly [26]. In such cases, it can often be assumed that the target data is distributed in subspaces of much lower dimensionality. Principle component analysis [27] is a linear model that has the ability to project the original data into orthogonal space which can captures the variance in the data. Many nonlinear subspace models have also been proposed, such as self-organizing map (SOM), auto-encoders, auto-associative networks [28], and kernel PCA [29].
2.2 Ensemble of one-class classifiers
Ensemble learning is concerned with mechanisms to combine the results of a number of weak learning systems to produce better learning performance. Several methodologies exist for creating an ensemble classifier from individual classifiers; a survey on the design of multiple classifier systems can be found in [6]. It has been demonstrated that combining classifiers can also be effective for one-class classifiers. The existing classifier combination strategies can also be used in one-class classifiers. Because for one-class classifiers, information concerning only one class is available; thus, the combining of one-class classifiers is more difficult. Tax and Duin investigated the influence of feature sets and the types of one-class classifiers for the best choice of the combination rule [30]. A bagging-based one-class support vector machine ensemble method was proposed in [31]. A dynamic ensemble strategy based on structural risk minimization [32] was proposed by Goh et al. for multi-class image annotation [7]. Recently, some research results have revealed that creating a one-class classifier ensemble from different feature subsets can provide better performance. Perdisci et al. [33] also used an ensemble of one-class SVMs to create a ‘high-speed payload-based’ anomaly detection system, in which the features were first extracted and clustered and the OCSVM ensemble was then constructed based on the clustered feature subsets. A biometric classification system combining different biometric features was proposed by Bergamini et al. [8], where the one-class SVMs in the ensemble were trained by the data from different people. The feature subset strategy provides diversity with respect to the base classifiers, and some researchers emphasize the importance of measuring diversity in ensembles so as to improve classification performance [9, 34].
Combining one-class classifiers has also shown promising performance in medicine and biology [35]. Peng Li et al. [36] proposed a multi-size patch-based classifier ensemble, which provides a multiple-level representation of image content, and this method was evaluated on colonoscopy images and ECG beat detection [37]. The k-nearest neighbor classifier was selected as the base classifier in the work of Okun and Priisalu [38] in which majority voting was chosen as the combination rules for the ensemble and the method was evaluated on gene expression cancer data.
3 One-class kernel subspace ensemble
In this section, the one-class kernel PCA model ensemble will be introduced. The theory of kernel PCA and pattern reconstruction via preimage will first be introduced, then the proposed KPCA ensemble will be described.
3.1 KPCA and pattern reconstruction via preimage
The traditional (linear) PCA tries to preserve the greatest variations of data by approximating data in a principle component subspace spanned by the leading eigenvectors, noises or less important data variations will be removed. Kernel PCA inherits this scheme; however, it performs linear PCA in the kernel feature space . Suppose is the original input data space and is a reproducing kernel Hilbert space (RKHS) (also called feature space) associated to a kernel function κ(x,y)=<φ(x),φ(y)>, where is a mapping induced by κ that . Given a set of patterns , kernel PCA performs the traditional linear PCA in . Similar to the linear PCA, KPCA also has the eigendecomposition:
where K is the kernel matrix such that K ij =κ(x i ,x j ), and
is the centering matrix, where I is the N×N identity matrix, 1=[1,1,…,1]′ is an N×1 vector, U=[α1,…,α N ] is the matrix containing eigenvectors α i =[αi 1,…,α iN ]′, and Λ=diag(λ1,…,λ N ) contains the corresponding eigenvalues.
Denote the mean of the φ-mapped patterns by . Then for a mapped pattern φ(x i ), the centered map can be defined as follows:
The k th eigenvector V k of the covariance matrix in the feature space is a linear combination of :
where . If we use β k to denote the projection of the φ-image of a pattern x onto the k th component V k , then:
where
where k x =[κ(x,x1),…,κ(x,x N )]′. Denote
then β k in Equation 5 can be rewritten as .
Therefore, the projection P(φ(x)) of φ(x) onto the subspace spanned by the first M eigenvectors can be obtained by:
where .
PCA is a simple method whereby a model for the distribution of training data can be generated. For linear distributions, PCA can be used; however, many real-world problems are nonlinear. Methods like Gaussian mixture models and auto-associative neural networks have been used for nonlinear problems. These methods, however, need to solve a nonlinear optimization problem and are thus prone to local minima and sensitive to the initialization [29]. KPCA runs PCA in the high-dimensional feature space through the nonlinearity of the kernel, and this allows for a refinement in the description of the patterns of interest. Therefore, kernel PCA was chosen to model the nonlinear distribution of the training samples here.
Kernel PCA has been widely used for classification tasks. A straightforward method using kernel PCA for classification is to directly use the distances between the mapped patterns in the feature space to obtain the classification boundaries [29, 39]. However, as pointed out in [29] for kernel PCA, their experimental results showed that the classification performance highly depends on the parameters selected for the kernel function, and there is no guideline for parameter selection in real classification tasks. It is also demonstrated in a more recent work that it is not sufficient to use kernel space distance for unsupervised learning algorithms, and the distances in the input space are more appropriate for classification [40].
In this paper, we focus on the distances between a pattern x and its reconstruction results by the kernel PCA models trained from different classes. As kernel PCA is used as an one-class classifier here, which means that for each class, at least one KPCA model is trained. Suppose there is an m-class classification task, there will be m KPCA models, one for each class. Given an unlabeled pattern x, every KPCA model will produce a projection P(φ(x)) i , i=1,…,m. During classification, x will be reconstructed in the input space by every P(φ(x)) i , then m reconstruction results can be obtained, the distance between x and each (also called reconstruction error) is calculated, and x will be assigned to the class whose KPCA model produces the minimum reconstruction error. Ideally, the KPCA model trained from the class which x also belongs to will always give the minimum reconstruction error. In our proposed classification scheme, multiple KPCA models are trained for each class and the reconstruction errors of KPCA models from different classes are combined for classification, which is demonstrated in Section 3.2 and Section 3.3.
In order to obtain the input-space distance between x and its reconstruction result, it is necessary to map P(φ(x)) back into the input space. The reverse mapping from feature space back to input space is called the preimage problem (Figure 1). However, the preimage problem is ill-posed and the exact preimage x′ of P(φ(x)) in the input space does not exist [41]; instead, one can only find an approximation in the input space such that
In order to address the preimage learning problem, some algorithms have been proposed. Mika et al. [41] proposed an iterative method to determine the preimage by minimizing least square distance error. Kwok and Tsang proposed a distance constraint learning (DCL) method to find preimage by using a similar technique in multi-dimensional scaling (MDS) [42]. In a more recent work, Zheng et al. [43] proposed a weakly supervised penalty strategy for preimage learning in KPCA; however, their method needs information for both positive and negative classes. As we are only interested in one-class scenarios, the distance constraint method in [42] was selected with respect to the work described in this paper. We briefly review the method here.
For any two patterns x i and x j in the input space, the Euclidean distance d(x i ,x j ) can be easily obtained. Similarly, the feature-space distance between their φ-mapped images in the feature space can also be obtained. For many commonly used kernels, such as the Gaussian kernels, there is a simple relationship between the feature-space distance and the input-space distance [44]:
Therefore,
As κ is invertible, d ij 2 can be obtained if is known.
A given training set has n patterns X={x1,…,x n }. For a pattern x in the input space, the corresponding φ(x) is projected to P(φ(x)), then for each training pattern x i in X, P(φ(x)) will be at a certain distance from φ(x i ) in the feature space. This feature-space distance can be obtained by:
The Equation 12 can be solved by using Equations 5 and 8. Therefore, the kernel space distances in Equation 11 between P(φ(x)) and each x i can be obtained now. Denote the kernel space distance between P(φ(x)) and x i as:
The location of will be obtained by requiring to be as close to the values in (13) as possible, i.e.,
To this end, in DCL, the training set X is constrained to the n nearest neighbors of x, and the least square optimization is used to obtain .
3.2 Construction of one-class KPCA ensemble for image classification
Given an image set of m classes, the proposed one-class KPCA ensemble is built as follows: (i) for each image category, n-type image features are extracted; (ii) a KPCA model will be trained for each individual type of the extracted features; and (iii) therefore, for each image class, n KPCA models will be constructed. For a m-class problem, there will be m×n KPCA models in the ensemble. The construction of the proposed one-class KPCA ensemble is illustrated in Figure 2, where represents the model trained by the type j feature from class i.
3.3 Multi-class prediction using an ensemble of one-class KPCA models
The classification confidence score is used to describe the probability of the image that belongs to each class. The confidence score can provide a quantitative measure of the predictions produced by KPCA models.
Given an unlabeled image x with n extracted features F={f1,f2,…,f n }, let represent the KPCA model belonging to class i and trained from the feature set f j , where i∈{1…m} is the class label and j∈{1…n} is the feature label. For classification, the preimages of each image feature f j ∈F will be obtained by all the KPCA models trained from the j th feature. The DCL method introduced in Section 3.1 is used for obtaining the preimages. For example, the preimages of f1 will be obtained by the models . Denote the preimages of f j as , and the squared distance D j between f j and f j′ is used as the reconstruction error, therefore:
where . In the same way, the preimages of all the features in F will be obtained, forming a distance matrix D, which has the dimensions n×m, where n is the number of KPCA models used for the preimage learning and m is the number of image classes. Each row of D represents the reconstruction errors of a feature in F by m KPCA models from each class:
Noting that the values in each column of D represents the reconstruction errors of F using the KPCA models from the same class, these values provide a measurement of how an image x is described by the KPCA models from one class. Since we try to find the KPCA models from a class which give the minimum reconstruction error, this indeed is a 1-nearest neighbor search, as we wish to find the best preimage of x in m preimages. Such a distance measure can improve the speed of the classification. Moreover, it is also in line with the ideas in metric multi-dimensional scaling, in which smaller dissimilarities are given more weight, and in locally linear embedding, where only the local neighborhood structure needs to be preserved [42].
In order to combine the reconstruction errors from KPCA models, the reconstruction errors in D are first normalized using Equation 17:
which models a Gaussian distribution from the square distance. The scale parameter s can be fitted to the distribution of . Moreover, Equation 17 has the feature that the scaled value is always bounded between 0 and 1. The normalized distance matrix D is denoted by .
The normalized reconstruction errors in are obtained by different one-class KPCA models, which can be combined to produce the confidence scores (CS) for classifying x into each class. Let C s={cs1,cs2,…,cs m } denote the confidence scores for x with respect to each image class. The confidence scores can be computed from the distance matrix by using an appropriate combination rule. A product rule was proposed in [45] for combining one-class classifiers:
where k is the label of the target class. is the probabilities of classifying x into the target class obtained from classifiers of class k, which can be calculated from the values in one column of the distance matrix as:
represents the probability of x belonging to the outlier class, which is obtained by multiplying all the values in except the values from the ‘target’ class k:
In [30], the authors investigated different mechanisms for combining one-class classifiers, and their results showed that the ‘product rule’ in Equation 18 outperforms other combining mechanisms for one-class classifiers. As noted in [30, 45], when using the product combining rule, P k (x|wT) should be available and a distance should be transformed to a ‘resemblance’ by some heuristic mapping as in Equation 17.
However, when one-class classifiers are used for multi-class classification tasks, the product rule in Equation 18 may not perform well. The number of the one-class classifiers constructed for the outlier classes will exceed the number of the classifiers for the target class; a problem of ‘imbalance’ thus occurs in Equation 18, where the items used for producing are much more than the items used for . During classification, some classifiers from the outlier classes may give small classification probabilities when the classifiers estimate that the pattern is not an outlier. In Equation 18, these small probabilities will still be used to calculate , even if there are more classifiers which have a different judgement. In this imbalance situation, due to those relatively small probabilities, a small value of will be obtained, approaching 0, which makes the classification confidence scores rather closed to each other.
Here, a variant of the product combining rule of Equation 18 is proposed to address the imbalance problem. Instead of using the mapping values from all the outlier classes’ KPCA models, for those models trained by a same type of image feature, only the model that gives the biggest mapping value will be chosen to produce . The proposed product combining rule can be described as:
where j is the image feature label and j=1…n. can be obtained using Equation 19. Each in is the probability of x belongs to the outlier classes using the j th image feature, which can be obtained by:
The maximum value selection procedure in Equation 21 is illustrated by a simple example in Figure 3. In Figure 3, there is a four-class classification task (I, II, III, and IV in the figure), in which four types of features are extracted from image x. For one type of image feature, there are four trained KPCA models, each from a different class, giving four reconstruction results for the same feature of x (one row in matrix ). If we consider class I as the ‘target’ class (first column in the figure), the four values in the first column are used to produce the item in Equation 21. The other three column of values are deemed as the outlier probabilities produced by the KPCA models from the other three classes. The proposed combining rule selects the maximum mapping value from each row to produce the outlier probability product .
The selection scheme in Equation 21 ensures that the numbers of items for calculating and are the same. Moreover, the negative effect on confidence scores brought by the imbalance can also be removed. The proposed combining rule is in line with the basic idea of one-class classification, as in the one-class scenario, one only needs to know if a pattern should be assigned to the target class or to the outlier class. If one or more outlier models is able to produce a high outlier probability product, the current target class should be doubted. Moreover, by combining the outliers value from different feature-derived models, the diversity of the ensemble will be improved, which is an important factor to make an ensemble learning method successful [46].
Note that since the ‘target class’ is unknown for an unlabeled image, during classification, each class will be deemed as the target class in turn to calculate the confidence score, i.e., each column in will be used in turn to obtain for each class. In such a way, for a m-class classification, each class will be deemed as the target class, one by one, to produce m confidence scores; thus the image will be assigned to the class giving the maximum classification confidence score.
4 Experiments and results
The effectiveness of the proposed method is illustrated using a biopsy breast cancer image set, a 3D OCT retinal image set, and the UCI Wisconsin breast cancer (diagnostic) dataset. The details of the image set and image feature extractors are given in Section 4.1. Section 4.2 introduces our experimental setup and the evaluation methods used in our experiments. The effectiveness of combining kernel PCAs is illustrated in Section 4.3. Finally, the proposed method was compared with some state-of-art ensemble classification methods on the UCI Wisconsin breast cancer dataset.
4.1 Image set and feature extraction
With respect to the work described in this paper, three medical image sets were used to evaluate the proposed classification method: A breast cancer benchmark biopsy images dataset from the Israel Institute of Technology [47], a 3D OCT retinal image set, and the breast cancer dataset (diagnostic) from UCI machine learning repository [48].
4.1.1 Breast cancer biopsy image set
The image set consists of 361 samples, of which 119 were classified by a pathologist as normal tissue, 102 as carcinoma in situ, and 140 as invasive ductal or lobular carcinoma. The samples were generated from breast tissue biopsy slides, stained with hematoxylin and eosin. They were photographed using a Nikon Coolpix Ⓡ 995 attached to a Nikon Eclipse Ⓡ E600 (Nikon Corporation, Shinjuku, Tokyo, Japan) at magnification of ×40 to produce images with resolution of about 5 μ per pixel. No calibration was made, and the camera was set to automatic exposure. The images were cropped to a region of interest of 760×570 pixels and compressed using the lossy JPEG compression. The resulting images were again inspected by a pathologist to ensure that their quality was sufficient for diagnosis. Figure 4 presents three sample images of healthy tissue, tumor in situ, and invasive carcinoma.
The shape feature and texture feature are critical factors for distinguishing one image from another. For the biopsy image discrimination, shapes and textures are also effective. As we can see from Figure 4, the three kinds of biopsy images have visible differences in cell externality and texture distribution. Thus, we use completed local binary patterns (CLBPs) [49] for extracting local textural features, gray level co-occurrence matrix (GLCM) [50] statistics for representing global textures, and the curvelet transform [51] for shape description. These feature descriptors have shown promising results in our previous work on biopsy image classification [52].
Different from traditional LBPs, in CLBPs a local region is represented by three coding operators to represent the central pixel, the difference signs, and the difference magnitudes [49]. According to the authors, CLBP can achieve much better rotation invariant texture classification results than conventional LBP-based schemes. In this paper, we use the 3D joint histogram of these three operators to generate textural features of breast cancer biopsy images, and the joint combination of the three components gives better classification than when using conventional LBPs and provides a smaller feature dimension. The dimension of the CLBP feature is 200.
The co-occurrence probabilities provide a second-order method for generating texture features. The basis for features used here is the gray level co-occurrence matrix [50]. With respect to the work described in this paper, a total of 22 features were extracted from gray level co-occurrence matrix, and they are listed in Table 1. Each of these statistics has a qualitative meaning with respect to the structure within the gray level co-occurrence matrix. The total dimension of the GLCM features is 220.
The fastest curvelet transform currently available is the curvelets via wrapping [51], which was therefore adopted with respect to our work. From the curvelet coefficients, some statistics can be calculated from each of these curvelet sub-bands. In this paper, the mean μ, the standard deviation δ, and the entropy H are used as the simple features. If n curvelets are used for the transform, 3n features G=[G μ ,G δ ,H] are obtained, where G μ =[μ1,μ2,…,μ n ], G δ =[δ1,δ2,…,δ n ], and H=[h1,h2,…,h n ]. A 3n-dimensional feature vector can be used to represent each image in the dataset. Using five levels of the curvelet transform, 82 sub-bands of curvelet coefficients are computed, therefore, a 246 dimensional curvelet feature vector is generated for each image.
4.1.2 3D OCT retinal image set
The 3D OCT retinal image set was collected at the Royal Hospital of University of Liverpool. The image set contains 140 volumetric OCT images, in which 68 images are from normal eyes and the remainder from eyes that have age-related macular degeneration (AMD). Figure 5 shows the example images.
The OCT images are preprocessed by using the Split Bregman Isotropic Total Variation algorithm with a least squares approach [53]. The preprocessing step has two targets: (i) identification and extraction of a volume of interest (VOI) which also results in noise removal and (ii) flattening of the retina as appropriate. The example images after preprocessing can be seen in Figure 6.
As the images are three-dimensional, following the work in [53], three types image features were used for image description: local binary patterns of three orthogonal planes (LBP-TOP), local phase quantization (LPQ) and multi-scale spatial pyramid (MSSP).
4.1.3 UCI breast cancer dataset
The Wisconsin breast cancer image sets were obtained from digitized images of fine needle aspirate (FNA) of breast masses. They describe characteristics of the cell nuclei present in the image. Ten real-valued features are computed for each cell nucleus: radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry and fractal dimension. The 569 images in the dataset are categorized into two classes: benign and malignant.
4.2 Experimental setup and performance evaluation methods
MATLAB 7 was used to implement the proposed process together with the Gaussian kernel k(x,y)=exp(−∥x−y∥2/2σ2). Other types of kernels could have been used; however, since the Gaussian kernel is commonly used for the kernel PCA, the SVDD, and the Parzen density, this kernel is the only kernel used with respect to the experiments reported here.
Unless otherwise stated, tenfold cross-validation was used, all the results are averages of ten runs of the tenfold cross-validation. The following measures are used to evaluate the proposed cascade method:
-
Recognition rate (RR) = number of correctly recognized images / number of testing images
-
ROC, receiver operating characteristic graph
-
AUC, area under an ROC curve
4.3 Evaluation of kernel PCA ensemble
The KPCA ensemble evaluation using the biopsy image data and the 3D OCT retinal image data is reported in this section. For the biopsy images, as introduced in Section 4.1, three types of image features were extracted, therefore for each image class, three kernel PCAs were built with respect to each type of image feature. The recognition rates of using these KPCAs individually are listed in column 2 to column 4 in Table 2, where CvletK, GLCMK, and LBPK represent KPCA models trained from curvelets, GLCM, and LBP, respectively. The results of combining all KPCA models are listed in the last column of Table 2; the combining rule is introduced in Equation 21. The parameters of KPCAs were set to σ=4 and n=40. The combined model gives the best classification performance for each image class; the averaged classification accuracy for these three image classes is 92.28%.
The evaluation results on the 3D OCT retinal images are list in Table 3. Three types of image features were extracted, namely LPQ, LBP-TOP, and MSSP. Therefore, for each image class three kernel PCAs were built with respect to each type of image feature. The recognition rates of using these KPCAs individually are listed in column 2 to column 4 in Table 3, where LPQK, LBPK, and MSSPK represent the KPCA models trained from LPQ, LBP-TOP, and MSSP, respectively. The results of combining all KPCA models are listed in the last column of Table 3. The parameters of KPCAs were set to σ=4 and n=40. The combined model gives the best classification performance for each image class; the averaged classification accuracy for these two image classes is 92.06%.
From Tables 2 and 3, one can see that using the proposed product combining rule, the classification accuracies of all the image classes have been improved. This illustrates that by combining one-class classifiers trained from different features can improve the classification performance, which is in accordance with the observation in [30]. For comparison, the other one-class classifiers are also used as the base classifier of the ensemble, using the same combining rule, the classification results on the biopsy image set and the 3D OCT retinal image set are listed in Tables 4 and 5, respectively.
With respect to the comparison of the operation of a variety of one-class classifiers, six one-class classifiers were used as the base classifier for the ensemble: they are Parzen, SVDD, PCA, Kmeans, MoG, and KPCA. The receiver operating characteristic (ROC) curves obtained using different one-class classifiers on the biopsy image data are shown in Figure 7. The x axis of the ROC curves is false positive rate (FPr) and the y axis is the true positive rate (TPr). The FPr and TPr are obtained by Equations 23 and 24, respectively. A threshold on the difference between the biggest confidence score and the second biggest confidence score was used to obtain the trade-off between TPr and FPr. Initially, the threshold was set to 0.05, then the threshold was increased by a step of 0.01 until 0.60, on each threshold value, and the TPr and FPr were accounted. The areas under the ROC curves (AUC), for the compared classifiers, are listed in Table 6; the KPCA ensemble gives the best result.
The proposed method was also compared with some state-of-art methods on the biopsy image set. The methods compared with are as follows: (i) the level set histogram (LSH) method proposed in [54]; (ii) a cascade classification system (CAS) in [55], which first classifies the images into ‘cancer’ and ‘non-cancer’ categories, then further classification is implemented within the ‘cancer’ category to discriminate different cancer types; (iii) a hybrid feature (HF) proposed in [56], which used higher-order spectra (HOS), local binary pattern (LBP), and laws texture energy (LTE) for histopathological image classification, in which the Takagi-Sugeno fuzzy model is selected as the classifier.
In our experiment, based on the description in [54], for LSH, the images were first converted to grayscale images that have the intensity range between 0 and 255, then 25 thresholds with the steps of 10 were used to convert the images into binary images (0 and 1). For each binary image, the level set segmentation was used to generate a 42-bin histogram for the connected components in the image. Thus, each image finally generated a feature vector with the size of 42×25=1,050. SVM with RBF kernel was used for classification with the parameter σ that defines the spread of the radial function set to 4.0, and the parameter C that defines the trade-off between the classifier accuracy and the margin was set to 3.0. For CAS, we used the same classifier, decision tree C5.0, and the same image features as stated in [55]. The feature vector for each image is a combination of first-order statistics, co-occurrence matrix, and steerable filters.
Table 7 lists the performance of the compared methods on the biopsy image set, where one can be noted that the proposed method achieved the better performance than other methods. The CAS method obtained an accuracy of 91.94%, which is superior than the accuracy of LSH and HF. The LSH method obtained only 87.38% accuracy on the biopsy image set. LSH only used the level set histograms for image description, while other compared methods all used composite image features, which demonstrates that using a combination of different image features can improve classification performance. Figure 8 presents the ROC curves of the compared methods; the AUC of the ROC curves are listed in Table 7.
For the 3D OCT retinal images, a method in [53] was used to compare with the proposed method. The method in [53] used the same image data, and the same image features introduced in Section 4.1.2 were composed together as the image feature, in which Bayes classifier was used for classification. A classification accuracy of 91.50% was reported by the authors, while our proposed system achieved 92.06%.
The proposed method was also compared with some state-of-art methods on the UCI breast cancer dataset. The methods compared are the following: (i) the multi-layer perceptron ensemble (MLPE) method proposed in [57]; (ii) a boosted neural network (BoostNN) classifier in [58]; (iii) a decision tree (DT) and support vector machine sequential minimal optimization (SVM-SMO) based ensemble classifier proposed by Luo and Cheng [59]. The results are listed in Table 8.
5 Conclusions
In this paper, a classification scheme based on a one-class KPCA model ensemble has been proposed for the classification of medical images. The ensemble consists of one-class KPCA models trained using different image features from each image class, and a proposed product combining rule was used for combining the kernel PCA models to produce classification confidence scores for assigning an image to each class. The effectiveness of the proposed classification scheme was verified using a breast cancer biopsy image dataset and a 3D OCT retinal image set. The proposed classification scheme obtained high classification accuracy on the tested image sets.
Although the proposed system has shown promising results with respect to the biopsy image classification task, there are still some aspects that need to be further investigated. The benchmark images used in this work were cropped from the original biopsy scans and only cover the important areas of the scans. However, it is often difficult to find regions of interest (ROIs) that contain the most important tissues in biopsy scans; therefore, more effort needs to be put into detecting ROIs from biopsy images. The parameters of the kernel PCA models, such as the number of principle components and the width of the Gaussian kernel, were fixed during the experiments. In the future research, some optimization methods or adaptive algorithms should be considered for searching the optimal parameters of KPCA models.
References
Boucheron LE: Object- and spatial-level quantitative analysis of multispectral histopathology images for detection and characterization of cancer. Thesis, University of California Santa Barbara, 2008
Loukas C: A survey on histological image analysis-based assessment of three major biological factors influencing radiotherapy: proliferation, hypoxia and vasculature. Comput. Methods Programs Biomed 2004, 74(3):183-199. 10.1016/j.cmpb.2003.07.001
Orlov N, Shamir L, Macura T, Johnston J, Eckley DM, Goldberg IG: WND-CHARM: multi-purpose image classification using compound image transforms. Pattern Recognit. Lett 2008, 29(11):1684-1693. 10.1016/j.patrec.2008.04.013
Kuncheva L, Rodriguez J, Plumpton C, Linden D, Johnston S: Random subspace ensembles for FMRI classification. IEEE Trans. Med. Imaging 2010, 29(2):531-542.
Tax D: One-class classification. Thesis, Delft University of Technology, 2001
Rokach L: Ensemble-based classifiers. Artif. Intell. Rev 2010, 33: 1-39. 10.1007/s10462-009-9124-7
Goh K-S, Chang EY, Li B: Using one-class and two-class SVMs for multiclass image annotation. IEEE Trans. Knowl. Data Eng 2005, 17(10):1333-1346.
Bergamini C, Oliveira L, Koerich A, Sabourin R: Combining different biometric traits with one-class classification. Signal Process 2009, 89: 2117-2127. 10.1016/j.sigpro.2009.04.043
Haghighi MS, Vahedian A, Yazdi HS: Creating and measuring diversity in multiple classifier systems using support vector data description. Appl. Soft Comput 2011, 11: 4931-4942. 10.1016/j.asoc.2011.06.006
Bryll R, Guitierrez-Osuna R, Quek F: Attribute bagging: improving accuracy of classifier ensembles by using random feature subsets. Pattern Recognit 2003, 36: 1291-1302. 10.1016/S0031-3203(02)00121-8
Kuncheva L, Jain LC: Designing classifier fusion systems by genetic algorithms. IEEE Trans. Evol. Comput 2000, 4(4):327-336. 10.1109/4235.887233
Zhang L, Zhang L: On combining multiple features for hyperspectral remote sensing image classification. IEEE Trans. Geoscience Remote Sensing 2012, 50(3):879-893.
Yu J, Lin F, Seah H-S, Li C, Lin Z: Image classification by multimodal subspace learning. Pattern Recognit. Lett 2012, 33: 1196-1204. 10.1016/j.patrec.2012.02.002
Moya M, Koch M, Hostetler L: One-class classifier networks for target recognition applications. In Proceedings of World Congress on Neural Networks. Portland; July 1993:797-801.
Khan SS, Madden MG: A survey of recent trends in one class classification. In Artificial Intelligence and Cognitive Science, Lecture Notes in Computer Science, vol. 6206. Edited by: Coyle L, Freyne J. Berlin, Heidelberg: Springer; 2010:188-197.
Markou M, Singh S: Novelty detection: a review-part 1: statistical approaches. Signal Process 2003, 83: 2481-2497. 10.1016/j.sigpro.2003.07.018
Markou M, Singh S: Novelty detection: a review-part 2: neural network based approaches. Signal Processing 2003, 83: 2499-2521. 10.1016/j.sigpro.2003.07.019
Tax DM, Duin RP: Support vector domain description. Pattern Recognit. Lett 1999, 20: 1191-1199. 10.1016/S0167-8655(99)00087-2
Tax DM, Duin RP: Support vector data description. Mach. Learn 2004, 54: 45-66.
Schölkopf B, Platt J, Shawe-Taylor J, Smola A, Williamson RC: Estimating the support of a high dimensional distribution. Neural Comput 2001, 13(7):1443-1472. 10.1162/089976601750264965
Manevitz LM, Yousef M: One-class SVMs for document classification. J. Mach. Learn. Res 2001, 2: 139-154.
Lewis DD: Test collections - Reuters-21578. . Accessed 22 June 2013 http://www.daviddlewis.com/resources/testcollections/reuters21578
Roth V: Kernel fisher discriminants for outlier detection. Neural Comput 2006, 18: 942-960. 10.1162/neco.2006.18.4.942
Ridder D, Tax D, Duin D: An experimental comparison of one-class classification methods. In Proceedings of the 4th Annual Conference of the Advanced School for Computing and Imaging. Holland: Delft; 1998:213-218.
Wang Q, Lopes L, Tax D: Visual object recognition through one-class learning. In International Conference on Image Analysis and Recognition, Porto, Portugal. Springer, Berlin; 2004:463-470.
Beyer K, Goldstein J, Ramakrishnan R, Shaft U: When is ‘nearest neighbor’ meaningful? Lect. Notes Comput. Sci 1999, 540: 217-235.
JIT: Principal Component Analysis. New York: Springer; 1986.
Zhang H, Huang W, Huang Z, Zhang B: A kernel autoassociator approach to patter classification. IEEE Trans. Syst., Man Cybernetics-Part B: Cybern 2005, 35(3):593-606. 10.1109/TSMCB.2005.843980
Hoffmann H: Kernel PCA for novelty detection. Pattern Recognit 2007, 40: 863-874. 10.1016/j.patcog.2006.07.009
Tax DM, Duin RP: Combining one-class classifiers. In Proceedings of Multiple Classifier Systems. Berlin: Springer; 2001:299-308.
Shieh AD, Kamm DF: Ensembles of one class support vector machines. In Proceedings of the Multiple Classifier Systems. Berlin: Springer; 2009:181-190.
Jain AK, Duin RPW, Mao J: Statistical pattern recognition: a review. IEEE Trans. Pattern Anal. Mach. Intell 2000, 22(1):4-37. 10.1109/34.824819
Perdisci R, Gu G: Using an ensemble of one-class SVM classifiers to harden payload-based anomaly detection systems. In Proceedings of the IEEE International Conference on Data Mining (ICDM 2006). Piscataway: IEEE Computer Society; 2006:488-498.
Krawczyk B: Diversity in ensembles for one-class classification. In Advances in Intelligent Systems and Computing, New trends in databases and information systems, vol. 185. Edited by: Pechenizkiy M, Wojciechowski M. Heidelberg: Springer, Berlin; 2013:119-129.
Yang P, Yang YH, Zhou BB, Zomaya AY: A review of ensemble methods in bioinformatics. Curr. Bioinformatics 2010, 5(4):296-308. 10.2174/157489310794072508
Li P, Chan KL, Krishnan SM: Learning a multi-size patch-based hybrid kernel machine ensemble for abnormal region detection in colonoscopic images. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR 2005). Piscataway: IEEE Computer Society; 2005:670-675.
Li P, Chan KL, Fu S, Krishnan SM: An abnormal ecg beat detector approach for long-term monitoring of heart patients based on hybrid kernel machine ensemble. In Proceedings of the International Workshop on Multiple Classifier Systems (MCS 2005). Heidelberg: Springer; 2005:346-355.
Okun O, Priisalu H: Dataset complexity in gene expression based cancer classification using ensembles of k-nearest neighbors. Artif. Intell. Med 2009, 45: 151-162. 10.1016/j.artmed.2008.08.004
Schölkopf B: The kernel trick for distances. Technical report MSR-TR-2000-51, Microsoft Research, Microsoft Corporation, One Microsoft Way, Redmond, WA 98052 (2000)
Kallas M, Honeine P, Richard C, Francis C, Amoud H: Non-negativity constraints on the pre-image for pattern recognition with kernel machines. Pattern Recognit 2013, 46: 3066-3080. 10.1016/j.patcog.2013.03.021
Mika S, Schölkopf B, Smola A, Müller K-R, Scholz M, Rätsch G: Kernel PCA and de-noising in feature spaces. In Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II. Cambridge: MIT Press; 1998:536-542.
Kwok JT-Y, Tsang IW-H: The pre-image problem in kernel methods. IEEE Trans. Neural Netw 2004, 15(6):1517-1525. 10.1109/TNN.2004.837781
Zheng W-S, Lai J, Yuen PC: Penalized preimage learning in kernel principle component analysis. IEEE Trans. Neural Netw 2010, 21(4):551-570.
Williams C: On a connection between kernel PCA and metric multidimensional scaling. In Advances in Neural Information Processing Systems 13, NIPS 2001. Cambridge: MIT Press; 2001:675-681.
Kitten J, Hate M, Duin RP, Matas J: On combining classifiers. IEEE Trans. Pattern Anal. Mach. Intell 1998, 20(3):226-239. 10.1109/34.667881
Kuncheva LI: Combining Pattern Classifiers: Methods and Algorithms. New York: Wiley; 2004.
Breast cancer data ftp://ftp.cs.technion.ac.il/pub/projects/medic-image. Accessed 22 June 2013
UCI: Machine learning repository. . Accessed 22 June 2013 http://archive.ics.uci.edu/ml/datasets/
Guo Z, Zhang L, Zhang D: A completed modeling of local binary pattern operator for texture classification. IEEE Trans. Image Process 2010, 19(6):1657-1663.
Haralick R, Shanmugam K, Dinstein I: Textural features for image classification. IEEE Trans. Syst., Man Cybern 1973, 3(6):610-621.
Candes E, Demanet L, Donoho D, Ying L: Fast discrete curvelet transforms. Multiscale Model. Simul 2006, 5: 861-899. 10.1137/05064182X
Zhang Y, Zhang B, Coenen F, Lu W: Breast cancer diagnosis from biopsy images with highly reliable random subspace classifier ensembles. Mach. Vis. Appl 2012, 1-17. doi:10.1007/s00138-012-0459-8
Albarrak A, Coenen F, Zheng Y: Age-related macular degeneration identification in volumetric optical coherence tomography using decomposition and local feature extraction. In Proceedings of 2013 International Conference on Medical Image, Understanding and Analysis. University of Birmingham; 17–19 July 2013:59-64.
Brook A, El-Yaniv R, Isler E, Kimmel R, Meir R, Peleg D: Breast cancer diagnosis from biopsy images using generic features and SVMs. Technical report CS-2008-07, Technion-Israel Institute of Technology, Technion City, Haifa 32000, Isreal (2006)
Doyle S, Feldman MD, Shih N, Tomaszewki J, Madabhushi A: Cascaded discrimination of normal, abnormal, and confounder classes in histopathology: Gleason grading of prostate cancer. BMC Bioinformatics 2012, 13(282):1-15.
Krishnan MMR, Venkatraghavan V, Acharya UR, Pal M, Paul RR, Min LC, Ray AK, Chatterjee J, Chakraborty C: Automated oral cancer identification using histopathological images: a hybrid feature extraction paradigm. BMC Bioinformatics 2012, 13(282):1-15.
Valdovinos R, Sanchez J: Performance analysis of classifier ensembles: neural networks versus nearest neighbor rule. Pattern Recognit Image Anal. (Lecture Notes in Computer Science) 2007, 4477: 105-112. 10.1007/978-3-540-72847-4_15
Gou S, Yang H, Jiao L, Zhuang X: Algorithm of partition based network boosting for imbalanced data classification. In Proceedings of the 2010 International Joint Conference on Neural Networks, IJCNN’10. Piscataway: IEEE; 2010:1-6.
Luo S, Cheng B: Diagnosing breast masses in digital mammography using feature selection and ensemble methods. J. Med. Syst 2012, 36(2):569-577. 10.1007/s10916-010-9518-8
Acknowledgements
The project is funded by Natural Science Foundation China grants 61262070 and EIN2011A001 and China Yunnan Provincial Natural Science Foundation grant 2010CD047.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
An erratum to this article can be found at http://dx.doi.org/10.1186/s13634-015-0274-2.
An erratum to this article is available at http://dx.doi.org/10.1186/s13634-015-0274-2.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Zhang, Y., Zhang, B., Coenen, F. et al. One-class kernel subspace ensemble for medical image classification. EURASIP J. Adv. Signal Process. 2014, 17 (2014). https://doi.org/10.1186/1687-6180-2014-17
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/1687-6180-2014-17