 Research
 Open Access
 Published:
Oneclass kernel subspace ensemble for medical image classification
EURASIP Journal on Advances in Signal Processing volume 2014, Article number: 17 (2014)
 The Erratum to this article has been published in EURASIP Journal on Advances in Signal Processing 2015 2015:88
Abstract
Classification of medical images is an important issue in computerassisted diagnosis. In this paper, a classification scheme based on a oneclass kernel principle component analysis (KPCA) model ensemble has been proposed for the classification of medical images. The ensemble consists of oneclass KPCA models trained using different image features from each image class, and a proposed product combining rule was used for combining the KPCA models to produce classification confidence scores for assigning an image to each class. The effectiveness of the proposed classification scheme was verified using a breast cancer biopsy image dataset and a 3D optical coherence tomography (OCT) retinal image set. The combination of different image features exploits the complementary strengths of these different feature extractors. The proposed classification scheme obtained promising results on the two medical image sets. The proposed method was also evaluated on the UCI breast cancer dataset (diagnostic), and a competitive result was obtained.
1 Introduction
Medical imaging is one of the most important tools in modern medicine; different types of imaging technologies such as Xray imaging, ultrasonography, biopsy imaging, computed tomography, and optical coherence tomography have been widely used in clinical diagnosis for various kinds of diseases. However, in clinical applications, it is usually timeconsuming to examine an image manually. Moreover, as there is always a subjective element related to the pathological examination of an image by human physician, an automated technique will provide valuable assistance for physicians. A large focus with respect to medical image analysis has been on automated image classification. Many recent studies have revealed that medical images can be properly classified if suitable image feature descriptions are chosen [1–3]. These research demonstrated that by combining different image description features, it is possible to improve medical image classification performance.
Although the classifiers which can provide multiclass classification such as support vector machines (SVM) and neural networks are usually selected for medical image classification [4], oneclass classifiers (OCC) [5] that can work on the samples seen are, so far, more appropriate for medical image classification task. Oneclass classification is also often called outlier (or novelty) detection as the learning algorithms are used to differentiate between data that appears normal and abnormal with respect to the distribution of the training data. This principle of oneclass classification is thus appropriate with respect to medical diagnosis and in disease versus nodisease problems.
In many real classification tasks, using a single classifier often fails to capture all aspects of the data. Therefore, a combination of classifiers (an ensemble) is often considered to be an appropriate mechanism to address this shortcoming. The main idea behind the ensemble methodology is to use several classifiers and combine the individual results in order to produce a classification that outperforms the outcome that would have been produced if the classifiers were to operate in isolation [6]. Ensembles of oneclass classifiers have also been shown to perform better than individual classifiers [7–9].
There are many strategies for constructing a classifier ensemble, with examples including the use of different training data sets, different feature subsets, various types of individual classifiers, and different fusion rules. Among these, the feature subset strategy has shown better performance when the dimensionality of the feature vector is high compared to the number of the data samples [10–13]. It is thus suggested that the feature subset ensemble strategy is consequently well suited to medical image classification problems, as various types of image features are generally extracted for medical image classification tasks, which in turn means that the dimensionality of the vector space is typically beyond the number of image samples, in which the ‘curse of dimensionality’ occurs, but the use of the feature subset strategy can avoid such problem.
In this paper, we propose and evaluate a novel classification scheme for medical images. The proposed classification scheme utilizes an ensemble of oneclass classifiers, which is built with the feature subset strategy; each oneclass classifier is trained with one type of features extracted from the training images. The kernel principle component analysis (KPCA) model was chosen as the base classifier of the ensemble. Given a mclass classification task and n different kinds of image features, the ensemble will consist of m×n KPCA models. For an unlabeled image, its ntypes of features will first be mapped into the kernel space by the corresponding ntrained KPCA models from each class. The mapped features will then be reconstructed from the high dimensional kernel space into the original space by preimage learning, the distances between the original features and the reconstructed features will be measured. The distances given by the KPCA models will be combined to output a confidence score describing the probability of the sample belonging to a class. For a mclass classification task, the m confidence scores will be obtained, one for each class. The image will be classified into the class with the maximum confidence score. Promising classification performance was obtained using the proposed classification scheme on two medical image sets.
2 Related works
In this section, we will first introduce some related works on oneclass classification. Then oneclass classifier ensembles will be discussed.
2.1 Oneclass classification
Moya et al. originated the term oneclass classification [14]. Many approaches to oneclass classification have been presented in the literature [5]. Following the taxonomy in the survey papers of [15–17], the algorithms used in OCC can be categorized as follows: (i) boundary methods, (ii) density estimation, and (iii) reconstruction methods.
Tax and Duin [18, 19] sought to solve the problem of OCC by distinguishing the positive class from all other patterns in the pattern space; the positive class data was surrounded by a hypersphere which encompassed almost all positive patterns within the minimum radius. This method of support vector data description (SVDD) was different to that proposed by Schölkopf et al. [20] who, using a separating hyperplane instead of a hypersphere, tried to separate the pattern space with data from the space containing no data. Manevitz and Yousef [21] proposed another version of oneclass SVM based on identifying outlier data as representative of the second class, and they applied their method to the standard Reuters[22] dataset and noted that their SVM methods was quite sensitive to the choice of representation and kernel. Although oneclass classifiers, such as OCSVM, have been widely used, the estimated boundary can be sensitive to the nature of the data [23]. This can be highly problematic for many applications, especially for medical diagnosis where the number of false positives must be kept to a minimum, since an accidental diagnosis of a cancer patient as healthy may result in death.
Density estimation methods estimate the density of the target class to form a model with which to represent the data. The generally used models include Parzen, Gaussian, and Gaussian mixture models. The test point is classified by the maximum posterior probability. Generally, this approach works well when the sample size is sufficiently high and a flexible model is used. However, when the model does not fit the data very well, a large bias may result. Details and some comparisons of these methods can be found in [24, 25].
As the density estimation or supportvectorbased methods require large training sets, when this is not feasible, one can approximate the target class by a simpler reconstruction model. This type of models tries to capture the data structure; new objects are projected onto this model. The reconstruction error, the difference between the original object x and the projected object p(x), indicates the resemblance of a new object to the original target distribution. When the training data has a very high dimensionality, the nearest neighbor methods tend to perform badly [26]. In such cases, it can often be assumed that the target data is distributed in subspaces of much lower dimensionality. Principle component analysis [27] is a linear model that has the ability to project the original data into orthogonal space which can captures the variance in the data. Many nonlinear subspace models have also been proposed, such as selforganizing map (SOM), autoencoders, autoassociative networks [28], and kernel PCA [29].
2.2 Ensemble of oneclass classifiers
Ensemble learning is concerned with mechanisms to combine the results of a number of weak learning systems to produce better learning performance. Several methodologies exist for creating an ensemble classifier from individual classifiers; a survey on the design of multiple classifier systems can be found in [6]. It has been demonstrated that combining classifiers can also be effective for oneclass classifiers. The existing classifier combination strategies can also be used in oneclass classifiers. Because for oneclass classifiers, information concerning only one class is available; thus, the combining of oneclass classifiers is more difficult. Tax and Duin investigated the influence of feature sets and the types of oneclass classifiers for the best choice of the combination rule [30]. A baggingbased oneclass support vector machine ensemble method was proposed in [31]. A dynamic ensemble strategy based on structural risk minimization [32] was proposed by Goh et al. for multiclass image annotation [7]. Recently, some research results have revealed that creating a oneclass classifier ensemble from different feature subsets can provide better performance. Perdisci et al. [33] also used an ensemble of oneclass SVMs to create a ‘highspeed payloadbased’ anomaly detection system, in which the features were first extracted and clustered and the OCSVM ensemble was then constructed based on the clustered feature subsets. A biometric classification system combining different biometric features was proposed by Bergamini et al. [8], where the oneclass SVMs in the ensemble were trained by the data from different people. The feature subset strategy provides diversity with respect to the base classifiers, and some researchers emphasize the importance of measuring diversity in ensembles so as to improve classification performance [9, 34].
Combining oneclass classifiers has also shown promising performance in medicine and biology [35]. Peng Li et al. [36] proposed a multisize patchbased classifier ensemble, which provides a multiplelevel representation of image content, and this method was evaluated on colonoscopy images and ECG beat detection [37]. The knearest neighbor classifier was selected as the base classifier in the work of Okun and Priisalu [38] in which majority voting was chosen as the combination rules for the ensemble and the method was evaluated on gene expression cancer data.
3 Oneclass kernel subspace ensemble
In this section, the oneclass kernel PCA model ensemble will be introduced. The theory of kernel PCA and pattern reconstruction via preimage will first be introduced, then the proposed KPCA ensemble will be described.
3.1 KPCA and pattern reconstruction via preimage
The traditional (linear) PCA tries to preserve the greatest variations of data by approximating data in a principle component subspace spanned by the leading eigenvectors, noises or less important data variations will be removed. Kernel PCA inherits this scheme; however, it performs linear PCA in the kernel feature space ${\mathbb{H}}_{\kappa}$. Suppose $\mathbb{X}\subset {\Re}^{n}$ is the original input data space and ${\mathbb{H}}_{\kappa}$ is a reproducing kernel Hilbert space (RKHS) (also called feature space) associated to a kernel function κ(x,y)=<φ(x),φ(y)>, where $x,y\in \mathbb{X}.\phantom{\rule{0.3em}{0ex}}\phi (\xb7)$ is a mapping induced by κ that $\phi \left(x\right):\mathbb{X}\to {\mathbb{H}}_{\kappa}$. Given a set of patterns $\{{x}_{1},{x}_{2},\dots ,{x}_{N}\}\in \mathbb{X}$, kernel PCA performs the traditional linear PCA in ${\mathbb{H}}_{\kappa}$. Similar to the linear PCA, KPCA also has the eigendecomposition:
where K is the kernel matrix such that K_{ ij }=κ(x_{ i },x_{ j }), and
is the centering matrix, where I is the N×N identity matrix, 1=[1,1,…,1]^{′} is an N×1 vector, U=[α_{1},…,α_{ N }] is the matrix containing eigenvectors α_{ i }=[α_{i 1},…,α_{ iN }]^{′}, and Λ=diag(λ_{1},…,λ_{ N }) contains the corresponding eigenvalues.
Denote the mean of the φmapped patterns by $\stackrel{\u0304}{\phi}=\frac{1}{N}\sum _{j=1}^{N}\phi \left({x}_{j}\right)$. Then for a mapped pattern φ(x_{ i }), the centered map $\stackrel{~}{\phi}\left({x}_{i}\right)$ can be defined as follows:
The k th eigenvector V_{ k } of the covariance matrix in the feature space is a linear combination of $\stackrel{~}{\phi}\left({x}_{i}\right)$:
where $\stackrel{~}{\phi}=\left[\stackrel{~}{\phi}\right({x}_{1}),\stackrel{~}{\phi}({x}_{2}),\mathrm{...},\stackrel{~}{\phi}({x}_{N}\left)\right]$. If we use β_{ k } to denote the projection of the φimage of a pattern x onto the k th component V_{ k }, then:
where
where k_{ x }=[κ(x,x_{1}),…,κ(x,x_{ N })]^{′}. Denote
then β_{ k } in Equation 5 can be rewritten as ${\beta}_{k}={\mathit{\alpha}}_{k}^{{}^{\prime}}{\stackrel{~}{\kappa}}_{x}$.
Therefore, the projection P(φ(x)) of φ(x) onto the subspace spanned by the first M eigenvectors can be obtained by:
where $\mathbf{L}=\sum _{k=1}^{M}{\mathit{\alpha}}_{k}{\mathit{\alpha}}_{k}^{\prime}$.
PCA is a simple method whereby a model for the distribution of training data can be generated. For linear distributions, PCA can be used; however, many realworld problems are nonlinear. Methods like Gaussian mixture models and autoassociative neural networks have been used for nonlinear problems. These methods, however, need to solve a nonlinear optimization problem and are thus prone to local minima and sensitive to the initialization [29]. KPCA runs PCA in the highdimensional feature space through the nonlinearity of the kernel, and this allows for a refinement in the description of the patterns of interest. Therefore, kernel PCA was chosen to model the nonlinear distribution of the training samples here.
Kernel PCA has been widely used for classification tasks. A straightforward method using kernel PCA for classification is to directly use the distances between the mapped patterns in the feature space ${\mathbb{H}}_{\kappa}$ to obtain the classification boundaries [29, 39]. However, as pointed out in [29] for kernel PCA, their experimental results showed that the classification performance highly depends on the parameters selected for the kernel function, and there is no guideline for parameter selection in real classification tasks. It is also demonstrated in a more recent work that it is not sufficient to use kernel space distance for unsupervised learning algorithms, and the distances in the input space are more appropriate for classification [40].
In this paper, we focus on the distances between a pattern x and its reconstruction results by the kernel PCA models trained from different classes. As kernel PCA is used as an oneclass classifier here, which means that for each class, at least one KPCA model is trained. Suppose there is an mclass classification task, there will be m KPCA models, one for each class. Given an unlabeled pattern x, every KPCA model will produce a projection P(φ(x))_{ i }, i=1,…,m. During classification, x will be reconstructed in the input space by every P(φ(x))_{ i }, then m reconstruction results ${x}_{1}^{{}^{\prime}},\dots ,{x}_{m}^{{}^{\prime}}$ can be obtained, the distance between x and each ${x}_{i}^{{}^{\prime}}$ (also called reconstruction error) is calculated, and x will be assigned to the class whose KPCA model produces the minimum reconstruction error. Ideally, the KPCA model trained from the class which x also belongs to will always give the minimum reconstruction error. In our proposed classification scheme, multiple KPCA models are trained for each class and the reconstruction errors of KPCA models from different classes are combined for classification, which is demonstrated in Section 3.2 and Section 3.3.
In order to obtain the inputspace distance between x and its reconstruction result, it is necessary to map P(φ(x)) back into the input space. The reverse mapping from feature space back to input space is called the preimage problem (Figure 1). However, the preimage problem is illposed and the exact preimage x^{′} of P(φ(x)) in the input space does not exist [41]; instead, one can only find an approximation $\widehat{x}$ in the input space such that
In order to address the preimage learning problem, some algorithms have been proposed. Mika et al. [41] proposed an iterative method to determine the preimage by minimizing least square distance error. Kwok and Tsang proposed a distance constraint learning (DCL) method to find preimage by using a similar technique in multidimensional scaling (MDS) [42]. In a more recent work, Zheng et al. [43] proposed a weakly supervised penalty strategy for preimage learning in KPCA; however, their method needs information for both positive and negative classes. As we are only interested in oneclass scenarios, the distance constraint method in [42] was selected with respect to the work described in this paper. We briefly review the method here.
For any two patterns x_{ i } and x_{ j } in the input space, the Euclidean distance d(x_{ i },x_{ j }) can be easily obtained. Similarly, the featurespace distance $\phantom{\rule{2.77626pt}{0ex}}\stackrel{~}{d}\left(\phi \right({x}_{i}),\phi ({x}_{j}\left)\right)$ between their φmapped images in the feature space can also be obtained. For many commonly used kernels, such as the Gaussian kernels, there is a simple relationship between the featurespace distance and the inputspace distance [44]:
Therefore,
As κ is invertible, d ij 2 can be obtained if $\phantom{\rule{2.77626pt}{0ex}}{\stackrel{~}{d}}_{\mathit{\text{ij}}}^{\phantom{\rule{0.3em}{0ex}}2}$ is known.
A given training set has n patterns X={x_{1},…,x_{ n }}. For a pattern x in the input space, the corresponding φ(x) is projected to P(φ(x)), then for each training pattern x_{ i } in X, P(φ(x)) will be at a certain distance $\phantom{\rule{2.77626pt}{0ex}}\stackrel{~}{d}\left(P\right(\phi \left(x\right)),\phi ({x}_{i}\left)\right)$ from φ(x_{ i }) in the feature space. This featurespace distance can be obtained by:
The Equation 12 can be solved by using Equations 5 and 8. Therefore, the kernel space distances in Equation 11 between P(φ(x)) and each x_{ i } can be obtained now. Denote the kernel space distance between P(φ(x)) and x_{ i } as:
The location of $\widehat{x}$ will be obtained by requiring ${d}^{2}(\widehat{x},{x}_{i})$ to be as close to the values in (13) as possible, i.e.,
To this end, in DCL, the training set X is constrained to the n nearest neighbors of x, and the least square optimization is used to obtain $\widehat{x}$.
3.2 Construction of oneclass KPCA ensemble for image classification
Given an image set of m classes, the proposed oneclass KPCA ensemble is built as follows: (i) for each image category, ntype image features are extracted; (ii) a KPCA model will be trained for each individual type of the extracted features; and (iii) therefore, for each image class, n KPCA models will be constructed. For a mclass problem, there will be m×n KPCA models in the ensemble. The construction of the proposed oneclass KPCA ensemble is illustrated in Figure 2, where ${\text{KPCA}}_{i}^{j}$ represents the model trained by the type j feature from class i.
3.3 Multiclass prediction using an ensemble of oneclass KPCA models
The classification confidence score is used to describe the probability of the image that belongs to each class. The confidence score can provide a quantitative measure of the predictions produced by KPCA models.
Given an unlabeled image x with n extracted features F={f_{1},f_{2},…,f_{ n }}, let ${\text{KPCA}}_{i}^{j}$ represent the KPCA model belonging to class i and trained from the feature set f_{ j }, where i∈{1…m} is the class label and j∈{1…n} is the feature label. For classification, the preimages of each image feature f_{ j }∈F will be obtained by all the KPCA models trained from the j th feature. The DCL method introduced in Section 3.1 is used for obtaining the preimages. For example, the preimages of f_{1} will be obtained by the models ${\text{KPCA}}_{i}^{1},i=1,\dots ,m$. Denote the preimages of f_{ j } as ${f}_{j}^{\prime}=\{{f}_{j}^{\prime 1},{f}_{j}^{\prime 2},\dots ,{f}_{j}^{\mathrm{\prime m}}\}$, and the squared distance D_{ j } between f_{ j } and f j′ is used as the reconstruction error, therefore:
where ${d}_{i}^{j}=\parallel {f}_{j}{f}_{j}^{\mathrm{\prime i}}{\parallel}^{2},i=1,\dots ,m$. In the same way, the preimages of all the features in F will be obtained, forming a distance matrix D, which has the dimensions n×m, where n is the number of KPCA models used for the preimage learning and m is the number of image classes. Each row of D represents the reconstruction errors of a feature in F by m KPCA models from each class:
Noting that the values in each column of D represents the reconstruction errors of F using the KPCA models from the same class, these values provide a measurement of how an image x is described by the KPCA models from one class. Since we try to find the KPCA models from a class which give the minimum reconstruction error, this indeed is a 1nearest neighbor search, as we wish to find the best preimage of x in m preimages. Such a distance measure can improve the speed of the classification. Moreover, it is also in line with the ideas in metric multidimensional scaling, in which smaller dissimilarities are given more weight, and in locally linear embedding, where only the local neighborhood structure needs to be preserved [42].
In order to combine the reconstruction errors from KPCA models, the reconstruction errors in D are first normalized using Equation 17:
which models a Gaussian distribution from the square distance. The scale parameter s can be fitted to the distribution of ${d}_{i}^{j}$. Moreover, Equation 17 has the feature that the scaled value is always bounded between 0 and 1. The normalized distance matrix D is denoted by $\stackrel{~}{D}$.
The normalized reconstruction errors in $\stackrel{~}{D}$ are obtained by different oneclass KPCA models, which can be combined to produce the confidence scores (CS) for classifying x into each class. Let C s={cs_{1},cs_{2},…,cs_{ m }} denote the confidence scores for x with respect to each image class. The confidence scores can be computed from the distance matrix $\stackrel{~}{D}$ by using an appropriate combination rule. A product rule was proposed in [45] for combining oneclass classifiers:
where k is the label of the target class. $\prod _{k}{P}_{k}\left(x\right{w}_{\mathrm{T}})$ is the probabilities of classifying x into the target class obtained from classifiers of class k, which can be calculated from the values in one column of the distance matrix $\stackrel{~}{D}$ as:
$\prod {P}_{k}\left(x\right{w}_{\mathrm{O}})$ represents the probability of x belonging to the outlier class, which is obtained by multiplying all the values in $\stackrel{~}{D}$ except the values from the ‘target’ class k:
In [30], the authors investigated different mechanisms for combining oneclass classifiers, and their results showed that the ‘product rule’ in Equation 18 outperforms other combining mechanisms for oneclass classifiers. As noted in [30, 45], when using the product combining rule, P_{ k }(xw_{T}) should be available and a distance should be transformed to a ‘resemblance’ by some heuristic mapping as in Equation 17.
However, when oneclass classifiers are used for multiclass classification tasks, the product rule in Equation 18 may not perform well. The number of the oneclass classifiers constructed for the outlier classes will exceed the number of the classifiers for the target class; a problem of ‘imbalance’ thus occurs in Equation 18, where the items used for producing $\prod _{k}{P}_{k}\left(x\right{w}_{\mathrm{O}})$ are much more than the items used for $\prod {P}_{k}\left(x\right{w}_{\mathrm{T}})$. During classification, some classifiers from the outlier classes may give small classification probabilities when the classifiers estimate that the pattern is not an outlier. In Equation 18, these small probabilities will still be used to calculate $\prod _{k}{P}_{k}\left(x\right{w}_{\mathrm{O}})$, even if there are more classifiers which have a different judgement. In this imbalance situation, due to those relatively small probabilities, a small value of $\prod _{k}{P}_{k}\left(x\right{w}_{\mathrm{O}})$ will be obtained, approaching 0, which makes the classification confidence scores rather closed to each other.
Here, a variant of the product combining rule of Equation 18 is proposed to address the imbalance problem. Instead of using the mapping values from all the outlier classes’ KPCA models, for those models trained by a same type of image feature, only the model that gives the biggest mapping value will be chosen to produce $\prod _{k}{P}_{k}\left(x\right{w}_{\mathrm{O}})$. The proposed product combining rule can be described as:
where j is the image feature label and j=1…n. $\prod _{k}{P}_{k}\left(x\right{w}_{\mathrm{T}})$ can be obtained using Equation 19. Each ${P}_{k}^{\phantom{\rule{2.77626pt}{0ex}}j}\left(x\right{w}_{\mathrm{O}})$ in $\prod _{j}{P}_{k}^{\phantom{\rule{2.77626pt}{0ex}}j}\left(x\right{w}_{\mathrm{O}})$ is the probability of x belongs to the outlier classes using the j th image feature, which can be obtained by:
The maximum value selection procedure in Equation 21 is illustrated by a simple example in Figure 3. In Figure 3, there is a fourclass classification task (I, II, III, and IV in the figure), in which four types of features are extracted from image x. For one type of image feature, there are four trained KPCA models, each from a different class, giving four reconstruction results for the same feature of x (one row in matrix $\stackrel{~}{D}$). If we consider class I as the ‘target’ class (first column in the figure), the four values in the first column are used to produce the item $\prod _{k}{P}_{k}\left(x\right{w}_{\mathrm{T}})$ in Equation 21. The other three column of values are deemed as the outlier probabilities produced by the KPCA models from the other three classes. The proposed combining rule selects the maximum mapping value from each row to produce the outlier probability product $\prod _{j}{P}_{k}^{\phantom{\rule{2.77626pt}{0ex}}j}\left(x\right{w}_{\mathrm{O}})$.
The selection scheme in Equation 21 ensures that the numbers of items for calculating $\prod _{k}{P}_{k}\left(x\right{w}_{\mathrm{O}})$ and $\prod {P}_{k}\left(x\right{w}_{\mathrm{T}})$ are the same. Moreover, the negative effect on confidence scores brought by the imbalance can also be removed. The proposed combining rule is in line with the basic idea of oneclass classification, as in the oneclass scenario, one only needs to know if a pattern should be assigned to the target class or to the outlier class. If one or more outlier models is able to produce a high outlier probability product, the current target class should be doubted. Moreover, by combining the outliers value from different featurederived models, the diversity of the ensemble will be improved, which is an important factor to make an ensemble learning method successful [46].
Note that since the ‘target class’ is unknown for an unlabeled image, during classification, each class will be deemed as the target class in turn to calculate the confidence score, i.e., each column in $\stackrel{~}{D}$ will be used in turn to obtain $\prod _{k}{P}_{k}\left(x\right{w}_{\mathrm{T}})$ for each class. In such a way, for a mclass classification, each class will be deemed as the target class, one by one, to produce m confidence scores; thus the image will be assigned to the class giving the maximum classification confidence score.
4 Experiments and results
The effectiveness of the proposed method is illustrated using a biopsy breast cancer image set, a 3D OCT retinal image set, and the UCI Wisconsin breast cancer (diagnostic) dataset. The details of the image set and image feature extractors are given in Section 4.1. Section 4.2 introduces our experimental setup and the evaluation methods used in our experiments. The effectiveness of combining kernel PCAs is illustrated in Section 4.3. Finally, the proposed method was compared with some stateofart ensemble classification methods on the UCI Wisconsin breast cancer dataset.
4.1 Image set and feature extraction
With respect to the work described in this paper, three medical image sets were used to evaluate the proposed classification method: A breast cancer benchmark biopsy images dataset from the Israel Institute of Technology [47], a 3D OCT retinal image set, and the breast cancer dataset (diagnostic) from UCI machine learning repository [48].
4.1.1 Breast cancer biopsy image set
The image set consists of 361 samples, of which 119 were classified by a pathologist as normal tissue, 102 as carcinoma in situ, and 140 as invasive ductal or lobular carcinoma. The samples were generated from breast tissue biopsy slides, stained with hematoxylin and eosin. They were photographed using a Nikon Coolpix ^{Ⓡ} 995 attached to a Nikon Eclipse ^{Ⓡ} E600 (Nikon Corporation, Shinjuku, Tokyo, Japan) at magnification of ×40 to produce images with resolution of about 5 μ per pixel. No calibration was made, and the camera was set to automatic exposure. The images were cropped to a region of interest of 760×570 pixels and compressed using the lossy JPEG compression. The resulting images were again inspected by a pathologist to ensure that their quality was sufficient for diagnosis. Figure 4 presents three sample images of healthy tissue, tumor in situ, and invasive carcinoma.
The shape feature and texture feature are critical factors for distinguishing one image from another. For the biopsy image discrimination, shapes and textures are also effective. As we can see from Figure 4, the three kinds of biopsy images have visible differences in cell externality and texture distribution. Thus, we use completed local binary patterns (CLBPs) [49] for extracting local textural features, gray level cooccurrence matrix (GLCM) [50] statistics for representing global textures, and the curvelet transform [51] for shape description. These feature descriptors have shown promising results in our previous work on biopsy image classification [52].
Different from traditional LBPs, in CLBPs a local region is represented by three coding operators to represent the central pixel, the difference signs, and the difference magnitudes [49]. According to the authors, CLBP can achieve much better rotation invariant texture classification results than conventional LBPbased schemes. In this paper, we use the 3D joint histogram of these three operators to generate textural features of breast cancer biopsy images, and the joint combination of the three components gives better classification than when using conventional LBPs and provides a smaller feature dimension. The dimension of the CLBP feature is 200.
The cooccurrence probabilities provide a secondorder method for generating texture features. The basis for features used here is the gray level cooccurrence matrix [50]. With respect to the work described in this paper, a total of 22 features were extracted from gray level cooccurrence matrix, and they are listed in Table 1. Each of these statistics has a qualitative meaning with respect to the structure within the gray level cooccurrence matrix. The total dimension of the GLCM features is 220.
The fastest curvelet transform currently available is the curvelets via wrapping [51], which was therefore adopted with respect to our work. From the curvelet coefficients, some statistics can be calculated from each of these curvelet subbands. In this paper, the mean μ, the standard deviation δ, and the entropy H are used as the simple features. If n curvelets are used for the transform, 3n features G=[G_{ μ },G_{ δ },H] are obtained, where G_{ μ }=[μ_{1},μ_{2},…,μ_{ n }], G_{ δ }=[δ_{1},δ_{2},…,δ_{ n }], and H=[h_{1},h_{2},…,h_{ n }]. A 3ndimensional feature vector can be used to represent each image in the dataset. Using five levels of the curvelet transform, 82 subbands of curvelet coefficients are computed, therefore, a 246 dimensional curvelet feature vector is generated for each image.
4.1.2 3D OCT retinal image set
The 3D OCT retinal image set was collected at the Royal Hospital of University of Liverpool. The image set contains 140 volumetric OCT images, in which 68 images are from normal eyes and the remainder from eyes that have agerelated macular degeneration (AMD). Figure 5 shows the example images.
The OCT images are preprocessed by using the Split Bregman Isotropic Total Variation algorithm with a least squares approach [53]. The preprocessing step has two targets: (i) identification and extraction of a volume of interest (VOI) which also results in noise removal and (ii) flattening of the retina as appropriate. The example images after preprocessing can be seen in Figure 6.
As the images are threedimensional, following the work in [53], three types image features were used for image description: local binary patterns of three orthogonal planes (LBPTOP), local phase quantization (LPQ) and multiscale spatial pyramid (MSSP).
4.1.3 UCI breast cancer dataset
The Wisconsin breast cancer image sets were obtained from digitized images of fine needle aspirate (FNA) of breast masses. They describe characteristics of the cell nuclei present in the image. Ten realvalued features are computed for each cell nucleus: radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry and fractal dimension. The 569 images in the dataset are categorized into two classes: benign and malignant.
4.2 Experimental setup and performance evaluation methods
MATLAB 7 was used to implement the proposed process together with the Gaussian kernel k(x,y)=exp(−∥x−y∥^{2}/2σ^{2}). Other types of kernels could have been used; however, since the Gaussian kernel is commonly used for the kernel PCA, the SVDD, and the Parzen density, this kernel is the only kernel used with respect to the experiments reported here.
Unless otherwise stated, tenfold crossvalidation was used, all the results are averages of ten runs of the tenfold crossvalidation. The following measures are used to evaluate the proposed cascade method:

Recognition rate (RR) = number of correctly recognized images / number of testing images

ROC, receiver operating characteristic graph

AUC, area under an ROC curve
4.3 Evaluation of kernel PCA ensemble
The KPCA ensemble evaluation using the biopsy image data and the 3D OCT retinal image data is reported in this section. For the biopsy images, as introduced in Section 4.1, three types of image features were extracted, therefore for each image class, three kernel PCAs were built with respect to each type of image feature. The recognition rates of using these KPCAs individually are listed in column 2 to column 4 in Table 2, where CvletK, GLCMK, and LBPK represent KPCA models trained from curvelets, GLCM, and LBP, respectively. The results of combining all KPCA models are listed in the last column of Table 2; the combining rule is introduced in Equation 21. The parameters of KPCAs were set to σ=4 and n=40. The combined model gives the best classification performance for each image class; the averaged classification accuracy for these three image classes is 92.28%.
The evaluation results on the 3D OCT retinal images are list in Table 3. Three types of image features were extracted, namely LPQ, LBPTOP, and MSSP. Therefore, for each image class three kernel PCAs were built with respect to each type of image feature. The recognition rates of using these KPCAs individually are listed in column 2 to column 4 in Table 3, where LPQK, LBPK, and MSSPK represent the KPCA models trained from LPQ, LBPTOP, and MSSP, respectively. The results of combining all KPCA models are listed in the last column of Table 3. The parameters of KPCAs were set to σ=4 and n=40. The combined model gives the best classification performance for each image class; the averaged classification accuracy for these two image classes is 92.06%.
From Tables 2 and 3, one can see that using the proposed product combining rule, the classification accuracies of all the image classes have been improved. This illustrates that by combining oneclass classifiers trained from different features can improve the classification performance, which is in accordance with the observation in [30]. For comparison, the other oneclass classifiers are also used as the base classifier of the ensemble, using the same combining rule, the classification results on the biopsy image set and the 3D OCT retinal image set are listed in Tables 4 and 5, respectively.
With respect to the comparison of the operation of a variety of oneclass classifiers, six oneclass classifiers were used as the base classifier for the ensemble: they are Parzen, SVDD, PCA, Kmeans, MoG, and KPCA. The receiver operating characteristic (ROC) curves obtained using different oneclass classifiers on the biopsy image data are shown in Figure 7. The x axis of the ROC curves is false positive rate (FPr) and the y axis is the true positive rate (TPr). The FPr and TPr are obtained by Equations 23 and 24, respectively. A threshold on the difference between the biggest confidence score and the second biggest confidence score was used to obtain the tradeoff between TPr and FPr. Initially, the threshold was set to 0.05, then the threshold was increased by a step of 0.01 until 0.60, on each threshold value, and the TPr and FPr were accounted. The areas under the ROC curves (AUC), for the compared classifiers, are listed in Table 6; the KPCA ensemble gives the best result.
The proposed method was also compared with some stateofart methods on the biopsy image set. The methods compared with are as follows: (i) the level set histogram (LSH) method proposed in [54]; (ii) a cascade classification system (CAS) in [55], which first classifies the images into ‘cancer’ and ‘noncancer’ categories, then further classification is implemented within the ‘cancer’ category to discriminate different cancer types; (iii) a hybrid feature (HF) proposed in [56], which used higherorder spectra (HOS), local binary pattern (LBP), and laws texture energy (LTE) for histopathological image classification, in which the TakagiSugeno fuzzy model is selected as the classifier.
In our experiment, based on the description in [54], for LSH, the images were first converted to grayscale images that have the intensity range between 0 and 255, then 25 thresholds with the steps of 10 were used to convert the images into binary images (0 and 1). For each binary image, the level set segmentation was used to generate a 42bin histogram for the connected components in the image. Thus, each image finally generated a feature vector with the size of 42×25=1,050. SVM with RBF kernel was used for classification with the parameter σ that defines the spread of the radial function set to 4.0, and the parameter C that defines the tradeoff between the classifier accuracy and the margin was set to 3.0. For CAS, we used the same classifier, decision tree C5.0, and the same image features as stated in [55]. The feature vector for each image is a combination of firstorder statistics, cooccurrence matrix, and steerable filters.
Table 7 lists the performance of the compared methods on the biopsy image set, where one can be noted that the proposed method achieved the better performance than other methods. The CAS method obtained an accuracy of 91.94%, which is superior than the accuracy of LSH and HF. The LSH method obtained only 87.38% accuracy on the biopsy image set. LSH only used the level set histograms for image description, while other compared methods all used composite image features, which demonstrates that using a combination of different image features can improve classification performance. Figure 8 presents the ROC curves of the compared methods; the AUC of the ROC curves are listed in Table 7.
For the 3D OCT retinal images, a method in [53] was used to compare with the proposed method. The method in [53] used the same image data, and the same image features introduced in Section 4.1.2 were composed together as the image feature, in which Bayes classifier was used for classification. A classification accuracy of 91.50% was reported by the authors, while our proposed system achieved 92.06%.
The proposed method was also compared with some stateofart methods on the UCI breast cancer dataset. The methods compared are the following: (i) the multilayer perceptron ensemble (MLPE) method proposed in [57]; (ii) a boosted neural network (BoostNN) classifier in [58]; (iii) a decision tree (DT) and support vector machine sequential minimal optimization (SVMSMO) based ensemble classifier proposed by Luo and Cheng [59]. The results are listed in Table 8.
5 Conclusions
In this paper, a classification scheme based on a oneclass KPCA model ensemble has been proposed for the classification of medical images. The ensemble consists of oneclass KPCA models trained using different image features from each image class, and a proposed product combining rule was used for combining the kernel PCA models to produce classification confidence scores for assigning an image to each class. The effectiveness of the proposed classification scheme was verified using a breast cancer biopsy image dataset and a 3D OCT retinal image set. The proposed classification scheme obtained high classification accuracy on the tested image sets.
Although the proposed system has shown promising results with respect to the biopsy image classification task, there are still some aspects that need to be further investigated. The benchmark images used in this work were cropped from the original biopsy scans and only cover the important areas of the scans. However, it is often difficult to find regions of interest (ROIs) that contain the most important tissues in biopsy scans; therefore, more effort needs to be put into detecting ROIs from biopsy images. The parameters of the kernel PCA models, such as the number of principle components and the width of the Gaussian kernel, were fixed during the experiments. In the future research, some optimization methods or adaptive algorithms should be considered for searching the optimal parameters of KPCA models.
References
 1.
Boucheron LE: Object and spatiallevel quantitative analysis of multispectral histopathology images for detection and characterization of cancer. Thesis, University of California Santa Barbara, 2008
 2.
Loukas C: A survey on histological image analysisbased assessment of three major biological factors influencing radiotherapy: proliferation, hypoxia and vasculature. Comput. Methods Programs Biomed 2004, 74(3):183199. 10.1016/j.cmpb.2003.07.001
 3.
Orlov N, Shamir L, Macura T, Johnston J, Eckley DM, Goldberg IG: WNDCHARM: multipurpose image classification using compound image transforms. Pattern Recognit. Lett 2008, 29(11):16841693. 10.1016/j.patrec.2008.04.013
 4.
Kuncheva L, Rodriguez J, Plumpton C, Linden D, Johnston S: Random subspace ensembles for FMRI classification. IEEE Trans. Med. Imaging 2010, 29(2):531542.
 5.
Tax D: Oneclass classification. Thesis, Delft University of Technology, 2001
 6.
Rokach L: Ensemblebased classifiers. Artif. Intell. Rev 2010, 33: 139. 10.1007/s1046200991247
 7.
Goh KS, Chang EY, Li B: Using oneclass and twoclass SVMs for multiclass image annotation. IEEE Trans. Knowl. Data Eng 2005, 17(10):13331346.
 8.
Bergamini C, Oliveira L, Koerich A, Sabourin R: Combining different biometric traits with oneclass classification. Signal Process 2009, 89: 21172127. 10.1016/j.sigpro.2009.04.043
 9.
Haghighi MS, Vahedian A, Yazdi HS: Creating and measuring diversity in multiple classifier systems using support vector data description. Appl. Soft Comput 2011, 11: 49314942. 10.1016/j.asoc.2011.06.006
 10.
Bryll R, GuitierrezOsuna R, Quek F: Attribute bagging: improving accuracy of classifier ensembles by using random feature subsets. Pattern Recognit 2003, 36: 12911302. 10.1016/S00313203(02)001218
 11.
Kuncheva L, Jain LC: Designing classifier fusion systems by genetic algorithms. IEEE Trans. Evol. Comput 2000, 4(4):327336. 10.1109/4235.887233
 12.
Zhang L, Zhang L: On combining multiple features for hyperspectral remote sensing image classification. IEEE Trans. Geoscience Remote Sensing 2012, 50(3):879893.
 13.
Yu J, Lin F, Seah HS, Li C, Lin Z: Image classification by multimodal subspace learning. Pattern Recognit. Lett 2012, 33: 11961204. 10.1016/j.patrec.2012.02.002
 14.
Moya M, Koch M, Hostetler L: Oneclass classifier networks for target recognition applications. In Proceedings of World Congress on Neural Networks. Portland; July 1993:797801.
 15.
Khan SS, Madden MG: A survey of recent trends in one class classification. In Artificial Intelligence and Cognitive Science, Lecture Notes in Computer Science, vol. 6206. Edited by: Coyle L, Freyne J. Berlin, Heidelberg: Springer; 2010:188197.
 16.
Markou M, Singh S: Novelty detection: a reviewpart 1: statistical approaches. Signal Process 2003, 83: 24812497. 10.1016/j.sigpro.2003.07.018
 17.
Markou M, Singh S: Novelty detection: a reviewpart 2: neural network based approaches. Signal Processing 2003, 83: 24992521. 10.1016/j.sigpro.2003.07.019
 18.
Tax DM, Duin RP: Support vector domain description. Pattern Recognit. Lett 1999, 20: 11911199. 10.1016/S01678655(99)000872
 19.
Tax DM, Duin RP: Support vector data description. Mach. Learn 2004, 54: 4566.
 20.
Schölkopf B, Platt J, ShaweTaylor J, Smola A, Williamson RC: Estimating the support of a high dimensional distribution. Neural Comput 2001, 13(7):14431472. 10.1162/089976601750264965
 21.
Manevitz LM, Yousef M: Oneclass SVMs for document classification. J. Mach. Learn. Res 2001, 2: 139154.
 22.
Lewis DD: Test collections  Reuters21578. . Accessed 22 June 2013 http://www.daviddlewis.com/resources/testcollections/reuters21578
 23.
Roth V: Kernel fisher discriminants for outlier detection. Neural Comput 2006, 18: 942960. 10.1162/neco.2006.18.4.942
 24.
Ridder D, Tax D, Duin D: An experimental comparison of oneclass classification methods. In Proceedings of the 4th Annual Conference of the Advanced School for Computing and Imaging. Holland: Delft; 1998:213218.
 25.
Wang Q, Lopes L, Tax D: Visual object recognition through oneclass learning. In International Conference on Image Analysis and Recognition, Porto, Portugal. Springer, Berlin; 2004:463470.
 26.
Beyer K, Goldstein J, Ramakrishnan R, Shaft U: When is ‘nearest neighbor’ meaningful? Lect. Notes Comput. Sci 1999, 540: 217235.
 27.
JIT: Principal Component Analysis. New York: Springer; 1986.
 28.
Zhang H, Huang W, Huang Z, Zhang B: A kernel autoassociator approach to patter classification. IEEE Trans. Syst., Man CyberneticsPart B: Cybern 2005, 35(3):593606. 10.1109/TSMCB.2005.843980
 29.
Hoffmann H: Kernel PCA for novelty detection. Pattern Recognit 2007, 40: 863874. 10.1016/j.patcog.2006.07.009
 30.
Tax DM, Duin RP: Combining oneclass classifiers. In Proceedings of Multiple Classifier Systems. Berlin: Springer; 2001:299308.
 31.
Shieh AD, Kamm DF: Ensembles of one class support vector machines. In Proceedings of the Multiple Classifier Systems. Berlin: Springer; 2009:181190.
 32.
Jain AK, Duin RPW, Mao J: Statistical pattern recognition: a review. IEEE Trans. Pattern Anal. Mach. Intell 2000, 22(1):437. 10.1109/34.824819
 33.
Perdisci R, Gu G: Using an ensemble of oneclass SVM classifiers to harden payloadbased anomaly detection systems. In Proceedings of the IEEE International Conference on Data Mining (ICDM 2006). Piscataway: IEEE Computer Society; 2006:488498.
 34.
Krawczyk B: Diversity in ensembles for oneclass classification. In Advances in Intelligent Systems and Computing, New trends in databases and information systems, vol. 185. Edited by: Pechenizkiy M, Wojciechowski M. Heidelberg: Springer, Berlin; 2013:119129.
 35.
Yang P, Yang YH, Zhou BB, Zomaya AY: A review of ensemble methods in bioinformatics. Curr. Bioinformatics 2010, 5(4):296308. 10.2174/157489310794072508
 36.
Li P, Chan KL, Krishnan SM: Learning a multisize patchbased hybrid kernel machine ensemble for abnormal region detection in colonoscopic images. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR 2005). Piscataway: IEEE Computer Society; 2005:670675.
 37.
Li P, Chan KL, Fu S, Krishnan SM: An abnormal ecg beat detector approach for longterm monitoring of heart patients based on hybrid kernel machine ensemble. In Proceedings of the International Workshop on Multiple Classifier Systems (MCS 2005). Heidelberg: Springer; 2005:346355.
 38.
Okun O, Priisalu H: Dataset complexity in gene expression based cancer classification using ensembles of knearest neighbors. Artif. Intell. Med 2009, 45: 151162. 10.1016/j.artmed.2008.08.004
 39.
Schölkopf B: The kernel trick for distances. Technical report MSRTR200051, Microsoft Research, Microsoft Corporation, One Microsoft Way, Redmond, WA 98052 (2000)
 40.
Kallas M, Honeine P, Richard C, Francis C, Amoud H: Nonnegativity constraints on the preimage for pattern recognition with kernel machines. Pattern Recognit 2013, 46: 30663080. 10.1016/j.patcog.2013.03.021
 41.
Mika S, Schölkopf B, Smola A, Müller KR, Scholz M, Rätsch G: Kernel PCA and denoising in feature spaces. In Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II. Cambridge: MIT Press; 1998:536542.
 42.
Kwok JTY, Tsang IWH: The preimage problem in kernel methods. IEEE Trans. Neural Netw 2004, 15(6):15171525. 10.1109/TNN.2004.837781
 43.
Zheng WS, Lai J, Yuen PC: Penalized preimage learning in kernel principle component analysis. IEEE Trans. Neural Netw 2010, 21(4):551570.
 44.
Williams C: On a connection between kernel PCA and metric multidimensional scaling. In Advances in Neural Information Processing Systems 13, NIPS 2001. Cambridge: MIT Press; 2001:675681.
 45.
Kitten J, Hate M, Duin RP, Matas J: On combining classifiers. IEEE Trans. Pattern Anal. Mach. Intell 1998, 20(3):226239. 10.1109/34.667881
 46.
Kuncheva LI: Combining Pattern Classifiers: Methods and Algorithms. New York: Wiley; 2004.
 47.
Breast cancer data ftp://ftp.cs.technion.ac.il/pub/projects/medicimage. Accessed 22 June 2013
 48.
UCI: Machine learning repository. . Accessed 22 June 2013 http://archive.ics.uci.edu/ml/datasets/
 49.
Guo Z, Zhang L, Zhang D: A completed modeling of local binary pattern operator for texture classification. IEEE Trans. Image Process 2010, 19(6):16571663.
 50.
Haralick R, Shanmugam K, Dinstein I: Textural features for image classification. IEEE Trans. Syst., Man Cybern 1973, 3(6):610621.
 51.
Candes E, Demanet L, Donoho D, Ying L: Fast discrete curvelet transforms. Multiscale Model. Simul 2006, 5: 861899. 10.1137/05064182X
 52.
Zhang Y, Zhang B, Coenen F, Lu W: Breast cancer diagnosis from biopsy images with highly reliable random subspace classifier ensembles. Mach. Vis. Appl 2012, 117. doi:10.1007/s0013801204598
 53.
Albarrak A, Coenen F, Zheng Y: Agerelated macular degeneration identification in volumetric optical coherence tomography using decomposition and local feature extraction. In Proceedings of 2013 International Conference on Medical Image, Understanding and Analysis. University of Birmingham; 17–19 July 2013:5964.
 54.
Brook A, ElYaniv R, Isler E, Kimmel R, Meir R, Peleg D: Breast cancer diagnosis from biopsy images using generic features and SVMs. Technical report CS200807, TechnionIsrael Institute of Technology, Technion City, Haifa 32000, Isreal (2006)
 55.
Doyle S, Feldman MD, Shih N, Tomaszewki J, Madabhushi A: Cascaded discrimination of normal, abnormal, and confounder classes in histopathology: Gleason grading of prostate cancer. BMC Bioinformatics 2012, 13(282):115.
 56.
Krishnan MMR, Venkatraghavan V, Acharya UR, Pal M, Paul RR, Min LC, Ray AK, Chatterjee J, Chakraborty C: Automated oral cancer identification using histopathological images: a hybrid feature extraction paradigm. BMC Bioinformatics 2012, 13(282):115.
 57.
Valdovinos R, Sanchez J: Performance analysis of classifier ensembles: neural networks versus nearest neighbor rule. Pattern Recognit Image Anal. (Lecture Notes in Computer Science) 2007, 4477: 105112. 10.1007/9783540728474_15
 58.
Gou S, Yang H, Jiao L, Zhuang X: Algorithm of partition based network boosting for imbalanced data classification. In Proceedings of the 2010 International Joint Conference on Neural Networks, IJCNN’10. Piscataway: IEEE; 2010:16.
 59.
Luo S, Cheng B: Diagnosing breast masses in digital mammography using feature selection and ensemble methods. J. Med. Syst 2012, 36(2):569577. 10.1007/s1091601095188
Acknowledgements
The project is funded by Natural Science Foundation China grants 61262070 and EIN2011A001 and China Yunnan Provincial Natural Science Foundation grant 2010CD047.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
An erratum to this article can be found at http://dx.doi.org/10.1186/s1363401502742.
An erratum to this article is available at http://dx.doi.org/10.1186/s1363401502742.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Zhang, Y., Zhang, B., Coenen, F. et al. Oneclass kernel subspace ensemble for medical image classification. EURASIP J. Adv. Signal Process. 2014, 17 (2014). https://doi.org/10.1186/16876180201417
Received:
Accepted:
Published:
Keywords
 Breast cancer diagnosis
 Biopsy image
 Oneclass classifier
 Kernel principle component analysis
 Classifier ensemble