Skip to content

Advertisement

  • Research Article
  • Open Access

Small-Sample Error Estimation for Bagged Classification Rules

EURASIP Journal on Advances in Signal Processing20102010:548906

https://doi.org/10.1155/2010/548906

  • Received: 2 April 2010
  • Accepted: 16 July 2010
  • Published:

Abstract

Application of ensemble classification rules in genomics and proteomics has become increasingly common. However, the problem of error estimation for these classification rules, particularly for bagging under the small-sample settings prevalent in genomics and proteomics, is not well understood. Breiman proposed the "out-of-bag" method for estimating statistics of bagged classifiers, which was subsequently applied by other authors to estimate the classification error. In this paper, we give an explicit definition of the out-of-bag estimator that is intended to remove estimator bias, by formulating carefully how the error count is normalized. We also report the results of an extensive simulation study of bagging of common classification rules, including LDA, 3NN, and CART, applied on both synthetic and real patient data, corresponding to the use of common error estimators such as resubstitution, leave-one-out, cross-validation, basic bootstrap, bootstrap 632, bootstrap 632 plus, bolstering, semi-bolstering, in addition to the out-of-bag estimator. The results from the numerical experiments indicated that the performance of the out-of-bag estimator is very similar to that of leave-one-out; in particular, the out-of-bag estimator is slightly pessimistically biased. The performance of the other estimators is consistent with their performance with the corresponding single classifiers, as reported in other studies.

Keywords

  • Root Mean Square
  • Linear Discriminant Analysis
  • Error Estimator
  • Classification Rule
  • Ensemble Classifier

1. Introduction

Ensemble classification methods combine the decision of multiple classifiers designed on randomly perturbed versions of the available data [15]. The most popular version of this scheme is known as bootstrap aggregating, or "bagging" [4, 5] where the ensemble classifier corresponds to a majority vote among classifiers designed on bootstrap samples [6] from the available training data.

There has been considerable interest recently in the application of bagging in the classification of both gene expression data [710] and protein-abundance mass spectrometry data [1116]. The popularity of bagging is based on the expectation that combining the decision of several classifiers will regularize and improve the performance of unstable, overfitting classification rules (the so-called "weak learners"). In a related study [17], the authors have investigated this claim, in the context of small-sample genomics and proteomics data. On the other hand, a different issue is the performance of error estimators for bagged classifiers. Accurate error estimation is a critical issue in Genomics, as it decisively impacts the scientific validity of hypotheses derived from application of pattern recognition methods to biomedical data [1820]. On the topic of error estimation, Breiman proposed a general method, which he called "out-of-bag", for estimating statistics of bagged classifiers [21], and, subsequently, other authors applied it to the estimation of the classification error [22, 23]. In this paper, we give an explicit definition of the out-of-bag estimator that is intended to remove estimator bias, which is done by formulating carefully how the error count is normalized. The performance of out-of-bag estimators with general bagged classification rules is not in fact well understood, especially in connection with bagging ensemble classifiers derived from classification rules other than decision trees (which was Breiman's primary interest). In addition, to our knowledge, no studies have attempted to assess the performance of error estimators for bagged classifiers in the context of Genomics data, particularly in the prevalent small-sample setting usually found in these applications.

To investigate these issues, we conducted an extensive simulation study of bagging of common classification rules, including LDA, 3NN, and CART, applied on both synthetic and real patient data, corresponding to the use of common error estimators such as resubstitution, leave-one-out, cross-validation, basic bootstrap, bootstrap 632, bootstrap 632 plus, bolstering, semibolstering, in addition to the out-of-bag estimator itself. We present here selected representative results; the full set of results can be found on the companion website, at http://gsp.tamu.edu/Publications/supplementary/oob. The results from the numerical experiments indicated that the performance of the out-of-bag error estimator is very similar to that of leave-one-out; in particular, the out-of-bag estimator is slightly pessimistically biased. The performance of the other estimators is for the most part consistent with their performance with the corresponding single classification rules assessed in other studies, with the best performance being provided by the bolstered error estimators, in terms of root mean square error.

This paper is organized as follows. In Section 2, we review briefly the definition of bagged ensemble classification rules. In Section 3, we describe the error estimators considered in this study. In Section 4, we present the results of a large simulation study on the performance of error estimators with bagged classification rules. Finally, Section 5 provides concluding remarks.

2. Bagged Classification Rules

In pattern recognition, classification is the process of assigning a group label to an object, based on information available about it in the form of a data vector called a feature vector. Suppose we have a binary classification problem with feature vector in a feature space and label . A classifier is a function . The stochastic properties of the classification problem are completely determined by the joint feature-label distribution of the pair . is, in practice, rarely known. Classification is implemented empirically, by means of the design of a classifier based on a finite set of i.i.d. samples drawn from :
(1)
For a fixed , a classification rule would be a function that maps the sample data to a classifier:
(2)
For a given training set , we have a designed classifier . The classification error is the chance of incorrectly classifying a future sample given the training sample set :
(3)

It is clear that is random as it depends on . The expected error taken over the randomness of is called expected classification error and this is a deterministic quantity which is a function of classification rule and the joint feature-label distribution.

The number of training samples is, in practice, always limited. Much effort is spent on exploiting and reusing the samples as much as possible. Randomization is one resampling technique in which multiple bootstrap sets are created by randomly drawing points from , either with or without replacement, corresponding to a resampling distribution on the training data. The cardinality of can be smaller, equal to or larger than , depending on the application of interest. In a bootstrap set, a sample point can appear multiple times or not at all. In bagging, different choices of resampling distribution and lead to variants, but the most common one is uniform resampling with .

An ensemble classifier is acquired based on majority voting among component classifiers. Each component of the ensemble is built up on a bootstrap set using the original classification rule . The bagged classification is defined as
(4)
where the expectation is taken with respect to the random mechanism , fixed at the observed value of . Bagging is a version of ensemble classifier, in which the expectation in (4) is approximated by Monte-Carlo sampling:
(5)

where the classifiers are designed by the original classification rule on bootstrap samples , for , for large enough . How large should be is an important topic of bagging so that it is computationally efficient and the Monte Carlo approximation is accurate enough. In this paper, we chose according to the recommendation from Breiman [21] and from our observations on the convergence of mean error of bagged classifiers in our previous study [17]. It is important to select an odd to avoid the issue of tie breaking in the majority vote. Experimental results in our previous study [17] showed that increasing beyond leads to negligible differences in performance.

3. Error Estimation

3.1. Classical Methods

Data in practice are often limited, and the training sample has to be used for both designing the classifier and estimating the true error . An obvious method to estimate is to use itself as the test set, which leads often, but not always, to optimistic bias. This is called the resubstitution estimator:
(6)
In -fold cross-validation, is partitioned into   folds , for (for simplicity, we assume that divides ), each fold is left out of the design process and used as a testing set, and the estimate is the overall proportion of error committed on all folds [24]:
(7)

where is a sample in the th fold. The process may be repeated, where several cross-validated estimates are computed, using different partitions of the data into folds, and the results averaged. In leave-one-out estimation, a single observation is left out each time, which corresponds to -fold cross-validation. The leave-one-out estimator is nearly unbiased as an estimator of .

3.2. Bootstrap Error Estimators

Resampling methodology, as mentioned above in generating ensemble classifiers, can be used for estimating errors. In fact, bootstrap error estimation was proposed by Efron [25], before its use in bagging. The actual proportion of times a data point appears in a bootstrap sample can be written as , where if the statement is true, zero otherwise. The basic bootstrap (or "zero bootstrap") is given by
(8)
With the number of bootstrap sample being between 25 and 200, as recommended in [25]. Bootstrap 632 is a variant, which tries to correct the bias of the basic bootstrap estimator by performing an average with the resubstitution estimator [25]:
(9)
Bootstrap 632 plus is another modified version of bootstrap, proposed in [26], which is intended for highly-overfitting classification rules. Bootstrap 632 plus attempts to adaptively find the weights in (9) that offset the effects of overfitting. The weights depend on the relative overfitting rate and no-information error rate . In dichotomous classification, and are estimated from , the proportion of observed samples belonging to class 1 and , the proportion of classifier outputs belonging to class 1. The relations are as follows:
(10)

3.3. Bolstered Error Estimators

Bolstered estimation was proposed in [27]. It has shown promising performance for small sample sizes in terms of root mean square error. While it is comparable to bootstrap methods in many cases, bolstered estimators are typically much more computationally efficient than the bootstrap. The main idea of bolstering is to put a kernel at each of the sample point, called "bolstering kernel" to smooth the variance of counting-based estimation methods (in this paper, we adopt Gaussian bolstering kernels). When the classifiers are overfitted, and hence, resubstitution estimates are optimistically biased, then bolstering at a misclassified point will increase this bias. Semibolstering is suggested for correcting this, by conducting no bolstering at misclassified points. We refer the reader to [27] for the full details (in this paper, we employ the bolstered and semibolstered resubstitution estimators of [27]).

3.4. Out-of-Bag Error Estimators

Breiman [21] originally proposed the out-of-bag method to estimate the generalization error of bagged predictors of CART and the node priority probabilities. Bylander [22] later did a simulation study comparing out-of-bag and cross-validation for tree classification C4.5 and concluded that both are biased. Banfield et al. [23] used out-of-bag in a large simulation of investigating performances of a variety of ensemble methods. Martínez-Muñoz and Suárez [28], in an attempt to find the optimal number of components of ensembles, employed out-of-bag as the optimization criterion. Despite that, the properties of the out-of-bag estimator remain largely unclear, in particular, the issue of bias. We propose in the sequel a modification to the standard out-of-bag estimator that removes nearly all of its bias (as evidenced by the numerical experiments in Section 4).

In bagging, component classifiers are designed based on bootstrap sets, each of which contain on average of the original sample set. Hence, there are approximately of the data which are not used to build the classifier and are therefore uncorrelated with it. Out-of-bag estimates are obtained by testing the majority voting classifier via those individual classifiers in the ensemble that are uncorrelated with the testing point, that is, those classifiers whose training sets do not contain the testing points. Suppose we resample the original sample set times, leading to bootstrap sample sets . Let if sample appears in the bootstrap sample , and , otherwise, for . Denote
(11)
for . Notice that is equal to the number of times that sample in class appears across all bootstrap sample sets, while is equal to the number of times that sample in class appears and is misclassified across all bootstrap sample sets. Then the out-of-bag error estimator, as proposed by Breiman in [4], can be written as
(12)
The estimator, as formulated above, will be optimistically biased, in general, according to the following rationale. Clearly, when and , then the th sample point belongs to all of the bootstrap samples, so there are no individual classifiers to test on the th point. In other words, the "out-of-bag ensemble" of classifiers for that point is empty in this case. That means that, with training sample size of , we often have fewer than samples to perform the out-of-bag estimation. In computing the proportion of incorrect classification by the ensemble, one should therefore divide not by as in (12), but rather by minus the number of times when the out-of-bag ensembles are empty, which leads to the following modified out-of-bag estimator:
(13)

As shown by the numerical results in Section 4, this estimator has approximately the bias of leave-one-out; that is, it is only slightly pessimistically biased. As far as we know, this formulation of the out-of-bag estimator has not been explicitly given in the literature.

4. Simulation Study

This section reports the results of an extensive simulation study, which were conducted on both synthetic and publicly available microarray data and protein abundance mass spectrometry data. We present here selected representative results; the full set of results can be found on the companion website, at http://gsp.tamu.edu/Publications/supplementary/oob. We simulated bagged ensembles of linear discriminant analysis (LDA), 3-nearest neighbors (3NN), and decision trees (CART) [24], and computed actual and estimated errors, according to the different estimation methods. These estimators were evaluated based on the distribution of their deviation from the true error, and in terms of bias, variance, and root mean square (RMS) errors.

4.1. Methods

We compared the performances of estimators for varying number of training samples with different dimensions of the feature space. The dimensionality and number of samples are selected to be compatible with a small-sample scenario (in this paper, the dimensionality is kept fixed at ). For patient data, a small number of features (once again, in this paper) are first selected by the -test. We afterwards randomly draw a number of samples to be used as the training set and employed the rest as a testing set. The number of training points are chosen to be small to keep the small sample setting, and to have a large enough testing set. This was repeated times to get the empirical deviation distribution [18], that is, the distribution of estimated minus actual errors, for the different error estimators. The results are presented in forms of beta-fit curves, box-plots, and bias, variance, and RMS curves in order to provide as detailed as possible a picture of the empirical deviation distributions of the error estimators.

4.2. Simulation Based on Synthetic Data

We employ here the spherical Gaussian model, where the covariance matrix is identity and the two mean vectors are symmetric over the origin. With that assumption, we varied the Bayes error of the model by changing the distance between the two means. Models with different Bayes errors and dimension are compared over varying number of samples. The feature-label distribution is known and this allows us to exactly compute the true error of the designed classifier, which is then used to derive the empirical deviation distribution for the different estimators.

4.3. Simulation Based on Patient Data

We utilized the following publicly available data sets from published studies in order to study the performance of bagging in the context of genomics and proteomics applications.

4.3.1. Breast Cancer Gene Expression Data

These data come from the breast cancer classification study in [29], which analyzed gene-expression microarrays containing a total of 25760 transcripts each. Filter-based feature selection was performed on a 70-gene prognosis profile, previously published by the same authors in [30]. Classification is between the good-prognosis class (115 samples), and the poor-prognosis class (180 samples), where prognosis is determined retrospectively in terms of survivability [29].

4.3.2. Lung Cancer Gene Expression Data

We employed here the data set "A" from the study in [31] on nonsmall cell lung carcinomas (NSCLC) that analyzed gene expression microarrays containing a total of 12600 transcripts each. NSCLC is subclassified as adenocarcinomas, squamous cell carcinomas, and large-cell carcinomas, of which adenocarcinomas are the most common subtypes and of interest to classify from other subtypes of NSCLC. Classification is thus between adenocarcinomas (139 samples) and nonadenocarcinomas (47 samples).

4.3.3. Prostate Cancer Mass Spectrometry Data

Given the recent keen interest on deriving serum-based proteomic biomarkers for the diagnosis of cancer [32], we also included in this study data from a proteomic study of prostate cancer reported in [33]. It consists of SELDI-TOF mass spectrometry of samples, which yield mass spectra for 45000 n/z (mass over charge) values. Filter-based feature selection is employed to find the top discriminatory n/z values to be used in the experiment. Classification is between prostate cancer patients (167 samples) and nonprostate patients, including benign prostatic hyperplasia and healthy patients (159 samples). We use the raw spectra values, without baseline subtraction, as we found that this leads to better classification rates.

4.4. Results and Discussion

4.4.1. Synthetic Data

The various error estimators can be grouped into four groups according to performance. The first group corresponds to resubstitution, which showed to be optimistically biased for the bagged LDA, 3NN, and CART classifiers, with a root mean square error that increases substantially with increasing Bayes error; resubstitution had been previously known to behave as such for single LDA, 3NN, and CART classifiers. The second group contains leave-one-out, fivefold cross-validation and out-of-bag. As we can see from Figure 1, the out-of-bag estimator, with the formulation given in (13), is almost identical to leave-one-out. This second group shows very small bias but considerably high variance. The resemblance of out-of-bag to cross-validation, which had been pointed out already in [22], is explained by the similar way of partitioning the sample set. This group shows much smaller bias than resubstitution, and this is consistent as the Bayes error increases. However, this group displayed larger variability than resubstitution and the bootstrap group, as we already knew from [19] on single classification rules. The third group includes the basic bootstrap, bootstrap 632, and bootstrap 632 plus; this group displays very competitive performance in terms of root mean square error. Even though they often perform better than the two previous groups, the estimators in this group took the longest time to compute across all experiments. The last group consists of the bolstered and semibolstered error estimators, which exhibit superior performance to the other groups, in terms of RMS error, despite being far less computationally expensive than cross-validation and bootstrap estimators.
Figure 1
Figure 1

Comparison of out-of-bag and leave-one-out for different Gaussian models over the number of samples, for dimensionality p=2. . (a) Sample mean. (b) Sample standard deviation.

Generally, for a fixed model, almost all the estimates work better when the sample size increases and this holds for all three bagged classifiers. In Figure 2, we see that there is a consistent trend; as the Bayes error increases or, equivalently, the classification problem becomes harder, error estimation performance decreases steadily, in term of root mean square error; this is true for all error estimation methods. Bolstered error estimators showed consistent superior performance to the others, in terms of accuracy (RMS) and computational cost. These conclusions are also supported by Figures 3 and 4.
Figure 2
Figure 2

Bias, variance (standard deviation), and RMS of as a function of the Bayes error, for the synthetic data, sample size n = 20, and dimensionality p = 2, with different base classification rules.

Figure 3
Figure 3

Empirical deviation distribution (a), box plots (b), and RMS as a function of sample size (c), for synthetic Gaussian model with Bayes error = 0.05, sample size n = 20, and dimensionality p = 2, with different base classification rules.

Figure 4
Figure 4

Empirical deviation distribution (a), box plots (b), and RMS as a function of sample size (c), for synthetic Gaussian model with Bayes error = 0.15, sample size n = 20, and dimensionality p = 2, with different base classification rules.

We observed that the performance of error estimators other than out-of-bag (which can only be applied to ensemble rules) were consistent with their performance with the corresponding single classifier, as reported in other studies [18, 27].

4.4.2. Patient Data

The results for the real patient data sets were entirely consistent with those for the synthetic data, as can be seen in Figures 5, 6, and 7 and Tables 1, 2, and 3. We again observed the division of the error estimators in the same four groups according to performance. We also observed that the bolstered error estimator group displayed the best performance, as measured by RMS.
Table 1

Bias, variance (standard deviation), and RMS for different error estimators, with different base classification rules, for breast cancer gene expression data, and dimensionality .

Rule

stat

resb

boot

bresb

loo

b632

oob

sbresb

b632plus

cv5

lda

20

bias

0.0388

0.0287

0.0104

0.0063

0.0039

0.0076

0.0244

0.0092

0.0143

  

sd

0.0908

0.0944

0.0789

0.1004

0.0912

0.1003

0.0933

0.0938

0.1140

  

rms

0.0988

0.0986

0.0795

0.1006

0.0913

0.1006

0.0964

0.0942

0.1149

lda

40

bias

0.0198

0.0082

0.0084

0.0012

0.0021

0.0002

0.0168

0.0011

0.0044

  

sd

0.0657

0.0642

0.0614

0.0671

0.0638

0.0673

0.0676

0.0641

0.0714

  

rms

0.0686

0.0647

0.0620

0.0671

0.0639

0.0673

0.0696

0.0641

0.0716

lda

60

bias

0.0157

0.0000

0.0097

0.0045

0.0058

0.0036

0.0104

0.0054

0.0011

  

sd

0.0577

0.0559

0.0544

0.0580

0.0560

0.0581

0.0586

0.0560

0.0586

  

rms

0.0598

0.0559

0.0553

0.0582

0.0563

0.0582

0.0595

0.0563

0.0587

cart

20

bias

0.1554

0.0456

0.0330

0.0226

0.0284

0.0267

0.0225

0.0096

0.0094

  

sd

0.0653

0.1047

0.0671

0.1210

0.0798

0.1229

0.0700

0.1059

0.1187

  

rms

0.1686

0.1142

0.0747

0.1231

0.0847

0.1258

0.0735

0.1063

0.1190

cart

40

bias

0.1583

0.0323

0.0358

0.0095

0.0378

0.0143

0.0284

0.0094

0.0058

  

sd

0.0484

0.0697

0.0502

0.0774

0.0533

0.0799

0.0516

0.0671

0.0810

  

rms

0.1655

0.0769

0.0616

0.0780

0.0653

0.0812

0.0589

0.0677

0.0812

cart

60

bias

0.1722

0.0211

0.0377

0.0001

0.0501

0.0043

0.0317

0.0232

0.0050

  

sd

0.0400

0.0624

0.0473

0.0705

0.0473

0.0701

0.0472

0.0590

0.0695

  

rms

0.1768

0.0658

0.0605

0.0705

0.0689

0.0703

0.0569

0.0634

0.0697

3nn

20

bias

0.0964

0.0575

0.0478

0.0270

0.0009

0.0269

0.0176

0.0273

0.0076

  

sd

0.0716

0.0996

0.0649

0.1174

0.0835

0.1167

0.0778

0.1005

0.1156

  

rms

0.1201

0.1150

0.0806

0.1204

0.0835

0.1197

0.0798

0.1041

0.1159

3nn

40

bias

0.0952

0.0406

0.0481

0.0109

0.0094

0.0139

0.0214

0.0075

0.0036

  

sd

0.0529

0.0687

0.0493

0.0787

0.0590

0.0785

0.0577

0.0669

0.0801

  

rms

0.1089

0.0798

0.0689

0.0794

0.0598

0.0797

0.0615

0.0673

0.0802

3nn

60

bias

0.0962

0.0316

0.0504

0.0034

0.0154

0.0054

0.0261

0.0012

0.0008

  

sd

0.0432

0.0625

0.0452

0.0693

0.0526

0.0693

0.0514

0.0595

0.0680

  

rms

0.1054

0.0701

0.0677

0.0694

0.0548

0.0695

0.0576

0.0595

0.0680

Table 2

Bias, variance (standard deviation), and RMS for different error estimators, with different base classification rules, for lung cancer gene expression data, and dimensionality .

Rule

stat

resb

boot

bresb

loo

b632

oob

sbresb

b632plus

cv5

lda

20

bias

0.0243

0.0238

0.0070

0.0075

0.0061

0.0103

0.0294

0.0094

0.0106

  

sd

0.0938

0.0938

0.0827

0.0989

0.0923

0.0988

0.0910

0.0932

0.1025

  

rms

0.0969

0.0967

0.0830

0.0992

0.0925

0.0993

0.0956

0.0937

0.1031

lda

40

bias

0.0118

0.0109

0.0012

0.0017

0.0025

0.0044

0.0273

0.0033

0.0045

  

sd

0.0675

0.0655

0.0628

0.0684

0.0656

0.0685

0.0652

0.0656

0.0694

  

rms

0.0685

0.0664

0.0628

0.0684

0.0657

0.0686

0.0707

0.0657

0.0695

lda

60

bias

0.0092

0.0067

0.0023

0.0004

0.0009

0.0015

0.0235

0.0012

0.0020

  

sd

0.0606

0.0587

0.0570

0.0608

0.0590

0.0608

0.0586

0.0590

0.0610

  

rms

0.0613

0.0591

0.0570

0.0608

0.0591

0.0609

0.0632

0.0590

0.0610

cart

20

bias

0.0945

0.0321

0.0025

0.0100

0.0145

0.0139

0.0076

0.0031

0.0017

  

sd

0.0502

0.0852

0.0623

0.0916

0.0683

0.0945

0.0676

0.0811

0.0849

  

rms

0.1069

0.0911

0.0623

0.0921

0.0699

0.0955

0.0681

0.0812

0.0849

cart

40

bias

0.0926

0.0226

0.0230

0.0071

0.0198

0.0088

0.0141

0.0071

0.0022

  

sd

0.0384

0.0630

0.0439

0.0694

0.0504

0.0705

0.0472

0.0577

0.0654

  

rms

0.1003

0.0670

0.0496

0.0698

0.0542

0.0710

0.0493

0.0581

0.0655

cart

60

bias

0.0938

0.0202

0.0277

0.0043

0.0218

0.0068

0.0210

0.0103

0.0012

  

sd

0.0335

0.0544

0.0397

0.0590

0.0438

0.0597

0.0414

0.0496

0.0571

  

rms

0.0996

0.0580

0.0484

0.0592

0.0490

0.0601

0.0464

0.0507

0.0571

3nn

20

bias

0.0483

0.0474

0.0185

0.0114

0.0122

0.0132

0.0027

0.0238

0.0040

  

sd

0.0552

0.0803

0.0529

0.0876

0.0677

0.0870

0.0623

0.0765

0.0787

  

rms

0.0734

0.0932

0.0561

0.0884

0.0688

0.0880

0.0624

0.0802

0.0788

3nn

40

bias

0.0489

0.0236

0.0270

0.0043

0.0031

0.0055

0.0094

0.0027

0.0004

  

sd

0.0435

0.0602

0.0411

0.0626

0.0519

0.0624

0.0484

0.0555

0.0593

  

rms

0.0655

0.0646

0.0492

0.0627

0.0520

0.0626

0.0493

0.0555

0.0593

3nn

60

bias

0.0500

0.0198

0.0317

0.0031

0.0059

0.0036

0.0147

0.0009

0.0028

  

sd

0.0381

0.0526

0.0383

0.0555

0.0459

0.0553

0.0439

0.0486

0.0514

  

rms

0.0629

0.0562

0.0497

0.0556

0.0462

0.0555

0.0463

0.0486

0.0514

Table 3

Bias, variance (standard deviation), and RMS for different error estimators, with different base classification rules, for prostate cancer mass-spectrometry data, and dimensionality .

Rule

stat

resb

boot

bresb

loo

b632

oob

sbresb

b632plus

cv5

lda

20

bias

0.0506

0.0181

0.0277

0.0033

0.0072

0.0044

0.0050

0.0019

0.0006

  

sd

0.0871

0.1025

0.0879

0.1031

0.0949

0.1037

0.0993

0.0985

0.1071

  

rms

0.1007

0.1041

0.0921

0.1031

0.0951

0.1038

0.0994

0.0985

0.1071

lda

40

bias

0.0283

0.0079

0.0189

0.0051

0.0054

0.0042

0.0029

0.0039

0.0031

  

sd

0.0609

0.0688

0.0626

0.0673

0.0647

0.0683

0.0674

0.0655

0.0693

  

rms

0.0672

0.0693

0.0654

0.0675

0.0649

0.0684

0.0675

0.0656

0.0694

lda

60

bias

0.0192

0.0045

0.0141

0.0042

0.0042

0.0044

0.0008

0.0035

0.0017

  

sd

0.0514

0.0572

0.0524

0.0542

0.0542

0.0549

0.0559

0.0546

0.0577

  

rms

0.0549

0.0573

0.0542

0.0544

0.0544

0.0550

0.0560

0.0547

0.0577

cart

20

bias

0.1504

0.0409

0.0500

0.0164

0.0295

0.0248

0.0441

0.0014

0.0059

  

sd

0.0693

0.1082

0.0765

0.1198

0.0847

0.1223

0.0791

0.1053

0.1169

  

rms

0.1655

0.1157

0.0914

0.1209

0.0897

0.1247

0.0905

0.1054

0.1170

cart

40

bias

0.1412

0.0320

0.0436

0.0047

0.0317

0.0096

0.0418

0.0108

0.0044

  

sd

0.0461

0.0701

0.0497

0.0753

0.0539

0.0773

0.0503

0.0646

0.0787

  

rms

0.1485

0.0771

0.0661

0.0755

0.0625

0.0779

0.0654

0.0655

0.0788

cart

60

bias

0.1397

0.0284

0.0404

0.0021

0.0334

0.0088

0.0393

0.0155

0.0049

  

sd

0.0347

0.0580

0.0418

0.0626

0.0441

0.0648

0.0424

0.0521

0.0636

  

rms

0.1439

0.0646

0.0581

0.0627

0.0554

0.0654

0.0578

0.0544

0.0637

3nn

20

bias

0.0820

0.0554

0.0488

0.0165

0.0048

0.0200

0.0371

0.0233

0.0104

  

sd

0.0748

0.1041

0.0757

0.1100

0.0871

0.1129

0.0805

0.0993

0.1037

  

rms

0.1110

0.1179

0.0901

0.1112

0.0872

0.1147

0.0886

0.1020

0.1043

3nn

40

bias

0.0673

0.0405

0.0377

0.0029

0.0008

0.0067

0.0271

0.0099

0.0040

  

sd

0.0458

0.0643

0.0460

0.0679

0.0536

0.0695

0.0504

0.0585

0.0644

  

rms

0.0814

0.0760

0.0595

0.0680

0.0536

0.0698

0.0572

0.0593

0.0645

3nn

60

bias

0.0660

0.0304

0.0375

0.0015

0.0051

0.0040

0.0269

0.0016

0.0006

  

sd

0.0389

0.0534

0.0393

0.0560

0.0451

0.0563

0.0435

0.0482

0.0557

  

rms

0.0766

0.0614

0.0543

0.0560

0.0454

0.0564

0.0511

0.0482

0.0557

Figure 5
Figure 5

Empirical deviation distribution (a) and box plots (b), for breast cancer gene expression data, sample size n = 20, and dimensionality p = 2, with different base classification rules.

Figure 6
Figure 6

Empirical deviation distribution (a) and box plots (b), for lung cancer gene expression data, sample size n = 20, and dimensionality p = 2, with different base classification rules.

Figure 7
Figure 7

Empirical deviation distribution (a) and box plots (b), for prostate cancer mass-spectrometry data, sample size n = 20, and dimensionality p = 2, with different base classification rules.

5. Conclusion

We presented an extensive study of several error estimation methods for bagged ensembles of typical classifiers. We provided here an explicit formulation for the out-of-bag error estimator, which is intended to remove estimator bias. We observed that this out-of-bag error estimator was almost identical to leave-one-out, under spherical Gaussian models, and conjectured a very close relationship between the two. The results of our simulation study were consistent between synthetic and real patient data, and the performance of error estimators that can be applied to single classifiers (i.e., all of them save for the out-of-bag estimator) with the bagged classifiers was comparable to their performance with the corresponding single classifier, as reported elsewhere. The bolstered error estimators exhibited the best performance, in terms of RMS error, in our simulation study, despite being far less computationally expensive than cross-validation and bootstrap estimators. We hope this work will provide useful guidance to practitioners working with bagged ensemble classifiers designed on small-sample data.

Declarations

Acknowledgment

This work was supported by the National Science Foundation, through NSF Award CCF-0845407.

Authors’ Affiliations

(1)
Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843-3128, USA

References

  1. Schapire RE: The strength of weak learnability. Machine Learning 1990, 5(2):197-227.Google Scholar
  2. Freund Y: Boosting a weak learning algorithm by majority. Proceedings of the 3rd Annual Workshop on Computational Learning Theory, 1990 202-216.Google Scholar
  3. Xu L, Krzyzak A, Suen C: Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Transactions on Systems, Man and Cybernetics 1992, 22(3):418-435. 10.1109/21.155943View ArticleGoogle Scholar
  4. Breiman L: Bagging predictors. Machine Learning 1996, 24(2):123-140.MathSciNetMATHGoogle Scholar
  5. Breiman L: Random forests. Machine Learning 2001, 45(1):5-32. 10.1023/A:1010933404324View ArticleMathSciNetMATHGoogle Scholar
  6. Efron B: Bootstrap methods: another look at the jackknife. Annals of Statistics 1979, 7: 1-26. 10.1214/aos/1176344552MathSciNetView ArticleMATHGoogle Scholar
  7. Alvarez S, Diaz-Uriarte R, Osorio A, Barroso A, Melchor L, Paz MF, Honrado E, Rodríguez R, Urioste M, Valle L, Díez O, Cigudosa JC, Dopazo J, Esteller M, Benitez J: A predictor based on the somatic genomic changes of the BRCA1/BRCA2 breast cancer tumors identifies the non-BRCA1/BRCA2 tumors with BRCA1 promoter hypermethylation. Clinical Cancer Research 2005, 11(3):1146-1153.Google Scholar
  8. Gunther EC, Stone DJ, Gerwien RW, Bento P, Heyes MP: Prediction of clinical drug efficacy by classification of drug-induced genomic expression profiles in vitro. Proceedings of the National Academy of Sciences of the United States of America 2003, 100(16):9608-9613. 10.1073/pnas.1632587100View ArticleGoogle Scholar
  9. Díaz-Uriarte R, Alvarez de Andrés S: Gene selection and classification of microarray data using random forest. BMC Bioinformatics 2006., 7, article 3:Google Scholar
  10. Statnikov A, Wang L, Aliferis CF: A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics 2008., 9, article 319:Google Scholar
  11. Izmirlian G: Application of the random forest classification algorithm to a SELDI-TOF proteomics study in the setting of a cancer prevention trial. Annals of the New York Academy of Sciences 2004, 1020: 154-174. 10.1196/annals.1310.015View ArticleGoogle Scholar
  12. Wu B, Abbott T, Fishman D, McMurray W, Mor G, Stone K, Ward D, Williams K, Zhao H: Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics 2003, 19(13):1636-1643. 10.1093/bioinformatics/btg210View ArticleGoogle Scholar
  13. Geurts P, Fillet M, de Seny D, Meuwis M-A, Malaise M, Merville M-P, Wehenkel L: Proteomic mass spectra classification using decision tree based ensemble methods. Bioinformatics 2005, 21(14):3138-3145. 10.1093/bioinformatics/bti494View ArticleGoogle Scholar
  14. Zhang B, Pham TD, Zhang Y: Bagging support vector machine for classification of SELDI-ToF mass spectra of ovarian cancer serum samples. Proceedings of the 20th Australian Joint Conference on Artificial Intelligence (AI '07), December 2007, Gold Coast, Australia, Lecture Notes in Computer Science 4830: 820-826.Google Scholar
  15. Assareh A, Moradi MH, Esmaeili V: A novel ensemble strategy for classification of prostate cancer protein mass spectra. Proceedings of the 29th Annual International Conference of IEEE-EMBS, Engineering in Medicine and Biology Society (EMBC '07), August 2007 5987-5990.Google Scholar
  16. Tong W, Xie Q, Hong H, Fang H, Shi L, Perkins R, Petricoin EF: Using decision forest to classify prostate cancer samples on the basis of SELDI-TOF MS data: assessing chance correlation and prediction confidence. Environmental Health Perspectives 2004, 112(16):1622-1627. 10.1289/ehp.7109View ArticleGoogle Scholar
  17. Vu TT, Braga-Neto UM: Is bagging effective in the classification of small-sample genomic and proteomic data? EURASIP Journal on Bioinformatics and Systems Biology 2009., 2009:Google Scholar
  18. Braga-Neto UM, Dougherty ER: Is cross-validation valid for small-sample microarray classification? Bioinformatics 2004, 20(3):374-380. 10.1093/bioinformatics/btg419View ArticleGoogle Scholar
  19. Braga-Neto U, Hashimoto R, Dougherty ER, Nguyen DV, Carroll RJ: Is cross-validation better than resubstitution for ranking genes? Bioinformatics 2004, 20(2):253-258. 10.1093/bioinformatics/btg399View ArticleGoogle Scholar
  20. Braga-Neto U, Dougherty E: Exact performance of error estimators for discrete classifiers. Pattern Recognition 2005, 38(11):1799-1814. 10.1016/j.patcog.2005.02.013View ArticleMATHGoogle Scholar
  21. Breiman L: Out-of-bag estimation. Department of Statistics, University of California; ftp://ftp.stat.berkeley.edu/pub/users/breiman/OOBestimation.ps.Z
  22. Bylander T: Estimating generalization error on two-class datasets using out-of-bag estimates. Machine Learning 2002, 48(1–3):287-297.View ArticleMATHGoogle Scholar
  23. Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP: A comparison of decision tree ensemble creation techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence 2007, 29(1):173-180.View ArticleGoogle Scholar
  24. Duda RO, Hart PE, Stork DG: Pattern Classification. 2nd edition. John Wiley & Sons, New York, NY, USA; 2001.MATHGoogle Scholar
  25. Efron B: Estimating the error rate of a prediction rule: improvement on cross-validation. Journal of the American Statistical Association 1983, 78(382):316-331. 10.2307/2288636MathSciNetView ArticleMATHGoogle Scholar
  26. Efron B, Tibshirani R: Improvements on cross-validation: the .632+ bootstrap method. Journal of the American Statistical Association 1997, 92(438):548-560. 10.2307/2965703MathSciNetMATHGoogle Scholar
  27. Braga-Neto U, Dougherty E: Bolstered error estimation. Pattern Recognition 2004, 37(6):1267-1281. 10.1016/j.patcog.2003.08.017View ArticleMATHGoogle Scholar
  28. Martínez-Muñoz G, Suárez A: Out-of-bag estimation of the optimal sample size in bagging. Pattern Recognition 2010, 43(1):143-152. 10.1016/j.patcog.2009.05.010View ArticleMATHGoogle Scholar
  29. van de Vijver MJ, He YD, van 'T Veer LJ, Dai H, Hart AAM, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ, Parrish M, Atsma D, Witteveen A, Glas A, Delahaye L, Van Der Velde T, Bartelink H, Rodenhuis S, Rutgers ET, Friend SH, Bernards R: A gene-expression signature as a predictor of survival in breast cancer. New England Journal of Medicine 2002, 347(25):1999-2009. 10.1056/NEJMoa021967View ArticleGoogle Scholar
  30. Van't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, Van Der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 415(6871):530-536. 10.1038/415530aView ArticleGoogle Scholar
  31. Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M, Loda M, Weber G, Mark EJ, Lander ES, Wong W, Johnson BE, Golub TR, Sugarbaker DJ, Meyerson M: Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proceedings of the National Academy of Sciences of the United States of America 2001, 98(24):13790-13795. 10.1073/pnas.191502998View ArticleGoogle Scholar
  32. Issaq HJ, Veenstra TD, Conrads TP, Felschow D: The SELDI-TOF MS approach to proteomics: protein profiling and biomarker identification. Biochemical and Biophysical Research Communications 2002, 292(3):587-592. 10.1006/bbrc.2002.6678View ArticleGoogle Scholar
  33. Adam B-L, Qu Y, Davis JW, Ward MD, Clements MA, Cazares LH, Semmes OJ, Schellhammer PF, Yasui Y, Feng Z, Wright GL Jr.: Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men. Cancer Research 2002, 62(13):3609-3614.Google Scholar

Copyright

© T. T. Vu and U. M. Braga-Neto. 2010

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Advertisement