Exact Performance of CoD Estimators in Discrete Prediction
© T. Chen and U. Braga-Neto. 2010
Received: 1 April 2010
Accepted: 9 July 2010
Published: 27 July 2010
The coefficient of determination (CoD) has significant applications in genomics, for example, in the inference of gene regulatory networks. We study several CoD estimators, based upon the resubstitution, leave-one-out, cross-validation, and bootstrap error estimators. We present an exact formulation of performance metrics for the resubstitution and leave-one-out CoD estimators, assuming the discrete histogram rule. Numerical experiments are carried out using a parametric Zipf model, where we compute exact performance metrics of resubstitution and leave-one-out CoD estimators using the previously derived equations, for varying actual CoD, sample size, and bin size. These results are compared to approximate performance metrics of 10-repeated 2-fold cross-validation and 0.632 bootstrap CoD estimators, computed via Monte Carlo sampling. The numerical results lead to a perhaps surprising conclusion: under the Zipf model under consideration, and for moderate and large values of the actual CoD, the resubstitution CoD estimator is the least biased and least variable among all CoD estimators, especially at small number of predictors. We also observed that the leave-one-out and cross-validation CoD estimators tend to perform the worst, whereas the performance of the bootstrap CoD estimator is intermediary, despite its high computational complexity.
The coefficient of determination (CoD) has significant applications in genomics, for example, in the inference of gene regulatory networks. We study several CoD estimators, based upon theresubstitution, leave-one-out, cross-validation, and bootstrap error estimators. We present an exact formulation of performance metrics for the resubstitution and leave-one-out CoD estimators, assuming the discrete histogram rule. Numerical experiments are carried out using aparametric Zipf model, where we compute exact performance metrics of resubstitution and leave-one-out CoD estimators using the previously derived equations, for varying actual CoD, sample size, and bin size. These results are compared to approximate performance metrics of10-repeated 2-fold cross-validation and 0.632 bootstrap CoD estimators, computed via Monte Carlo sampling. The numerical results lead to a perhaps surprising conclusion: under the Zipf model under consideration, and for moderate and large values of the actual CoD,the resubstitution CoD estimator is the least biased and least variable among all CoD estimators, especially at small number of predictors. We also observed that the leave-one-out andcross-validation CoD estimators tend to perform the worst whereas the performance of the bootstrap CoD estimator is intermediary, despite its high computational complexity.
where is the error of the best predictor of in the absence of other observations and is the error of the best predictor of based on the observation of . The binary CoD measures the relative decrease in prediction error when using predictor variables to estimate the target variable, as opposed to using no predictor variables. The closer it is to one, the tighter the regulation of the target variable by the predictor variables is, whereas the closer it is to zero, the looser the regulation is. The CoD will correctly produce low values in cases where the no-predictor error is already small, or when adding predictors does not contribute to a significant decrease in error. The CoD is a function only of the joint distribution between predictors and target, thus it characterizes the regulatory relationship among them.
The concept of CoD has far-reaching applications in Genomics. The CoD was perhaps the first predictive paradigm utilized in the context of microarray data, the goal being to provide a measure of nonlinear interaction among genes [1–6]. In [2, 4, 6], the CoD is applied to the prediction problem dealing with gene expressions quantized into discrete levels in discrete prediction. In [3, 5], the CoD has its application in the reconstruction or inference of gene regulatory networks. As its classic counterpart, the binary CoD is a goodness-of-fit statistic that can be used to assess the relationship between predictor and target variables, for example, the associations between gene expression patterns in practical applications. The CoD permits biologists to focus on particular connections in the genome, and the estimated coefficients provide a practical criterion for selecting among potential predictor sets .
The error of the best predictor corresponds to the optimal prediction error, also known as Bayes error, which depends only on the underlying probability model . However, in practical real-world problems, the underlying probability model is unknown, and thus we arrive at the fundamental issue of how to find a good prediction error estimator in small-sample settings [8, 9]. An error estimator may be a deterministic function of the sample data, in which case it is called a nonrandomized error estimator; such popular error estimators as resubstitution and leave-one-out are examples. These error estimators are random only through the random sample data. Closed-form analytical expressions for performance metrics such as bias, deviation variance, and RMS of resubstitution and leave-one-out error estimators have been given in [9, 10]. By contrast, randomized error estimators, like cross-validation and bootstrap, have "internal" random factors that affect their outcome, and thus approximate approaches, usually via Monte Carlo sampling, are typically used to analyze their performance.
Likewise, the CoD must in practice be estimated from sample data. A CoD estimator is obtained from (1) by using one of the usual error estimators for the prediction error with variables , and the empirical frequency estimator for the prediction error with no variables ; we may speak thus of non-randomized CoD estimators, including the resubstitution and leave-one-out CoD estimators, and randomized CoD estimators, including bootstrap and cross-validation CoD estimators. The CoD with the true values of and in (1) will be called in this paper the "actual CoD." We will employ the discrete histogram rule [7, 8], the most widely used and intuitive rule for discrete prediction problems, in order to estimate prediction errors and CoDs from the sample data.
This paper presents, for the first time, an exact formulation for performance metrics of the resubstitution and leave-one-out CoD estimators, for the discrete histogram rule. Numerical experiments are carried out using a parametric Zipf model, where we compute the exact performance of resubstitution and leave-one-out CoD estimators using the previously derived equations, for varying actual CoD, sample size, and bin size. We compare these results to approximate performance metrics of randomized CoD estimators (bootstrap and cross-validation), computed via Monte Carlo sampling. The numerical results indicate that, under the Zipf model under consideration, and for moderate and large values of the actual CoD, the resubstitution CoD estimator is the least biased and least variable among all CoD estimators, especially at small number of predictors. In fact, with two predictors, the resubstitution CoD nearly dominates uniformly over all other estimators across all values of actual CoD. The leave-one-out and cross-validation CoD estimator tend to perform the worst whereas the performance of the bootstrap CoD estimator is intermediary, despite its high computational complexity. This indicates that provided one has evidence of moderate to tight regulation between the genes, and the number of predictors is not too large, one should use the CoD estimator based on resubstitution.
This paper is organized as follows. In section 2, the probability model used in discrete prediction is introduced. In section 3, the discrete histogram rule is recalled, and formal definitions are given for the actual CoD and several CoD estimators, including two non-randomized CoD estimators (i.e., resubstitution and leave-one-out) and two randomized CoD estimators (i.e.,.632 bootstrap and 10-repeated 2-fold cross-validation). Section 4 introduces performance metrics (i.e., bias, deviation variance, RMS) of a CoD estimator, and Section 5 presents an analytical formulation of exact performance metrics of the resubstitution and leave-one-out CoD estimators. In Section 6, we present numerical results, based on a parametric Zipf model, that compare the performance metrics of all the CoD estimators considered in this paper. Finally, Section 7 presents concluding remarks.
2. Discrete Prediction
Let be predictor random variables, such that each take on a finite number of values, and be the target random variable, for the discrete prediction problem. The predictors as a group can take on values in a finite space with possible states. For analysis purposes, we establish a bijection between this finite state space and a single predictor variable taking values in the set . The variable has a one-to-one relationship with the finite space state coded by : one specific value of represents a specific combination of the values of the original predictors, that is, a "bin" into which the data is categorized. The value is the number of bins, which provides a direct measure of predictor complexity.
Since , we have that . We have if and only if , that is, there is perfect regulation between predictors and target. On the other hand, if and only if , that is, the predictors exert no regulation on the target.
3. CoD Estimation
where and are random variables corresponding to the number of sample points belonging to classes and , respectively. We assume throughout that , that is, each class is represented by at least one sample. Note that has the desirable property of being a universally consistent estimator of in (5), that is, in probability (in fact, almost surely) as , regardless of the probability model.
We introduce next each of the CoD estimators considered in this paper.
3.1. Resubstitution CoD Estimator
3.2. Leave-One-Out CoD Estimator
The leave-one-out CoD estimator provides an opportunity to reflect on the uniform choice of the empirical frequency estimator in (8) as an estimator of , including here. Clearly, the empirical frequency corresponds to the resubstitution estimator of . The question arises as to whether, for the leave-one-out CoD estimator, the leave-one-out error estimator of should be used instead. For , we get with the choice of the resubstitution estimator (empirical frequency), but with the choice of leave-one-out estimator, which is a useless result. Similar problems beset other estimators of . Hence, the empirical frequency estimator is employed here as the estimator of for all CoD estimators.
3.3. Cross-Validation CoD Estimator
In order to get reasonable variance properties, a large number of repetitions may be required, which can make the cross-validation CoD estimator slow to compute.
3.4. Bootstrap CoD Estimator
4. Performance Metrics of CoD Estimators
5. Exact Moments of Nonrandomized CoD Estimators
As mentioned in the Introduction, we can categorize CoD estimators into non-randomized and randomized, depending on whether the prediction error estimator is non-randomized or randomized. Non-randomized CoD estimators, such as the resubstitution and leave-one-out CoD estimators, are deterministic functions of the sample data, which makes it possible an analytical formulation of their performance metrics. On the other hand, the performance of randomized CoD estimators, such as the cross-validation and bootstrap CoD estimators, is very difficult to study analytically and is typically investigated via Monte Carlo sampling (which is done in Section 6).
In this section, we will present exact expressions for the computation of the first moment and the second moment for the case of resubstitution and leave-one-out error estimators, which suffices to compute the bias, variance, and RMS of the corresponding CoD estimator, as discussed in the previous section. These expressions are functions only of sample size, number of bins (complexity), and the probability model. We will assume throughout, for definiteness, that the sample size is even. The case where is odd is in fact slightly simpler and can be readily obtained in analogous fashion to the derivations presented below.
6. Numerical Experiments
Assuming a parametric probability model in this section, we plot the exact performance metrics of the resubstitution and leave-one-out CoD estimators, by using the analytical expressions obtained in Sections 4 and 5, under varying actual CoD, sample size, and predictor complexity (number of bins). We also compare these exact performance metrics with the approximate performance metrics for cross-validation and bootstrap CoD estimators computed via Monte Carlo sampling. The Monte Carlo computation was carried out by drawing simulated training data sets of the required sample size from the probability model in each case, and employing sample means and sample variances to approximate the performance metrics in Section 4.
For simplicity, we assume that . It can be seen easily from (6) that the CoD increases monotonically with , so that large leads to tight regulation, that is, easy prediction, and vice versa. There are two extreme cases. When , there is maximal confusion between the classes, and . When , there is maximal discrimination between the classes, and . Thus, varying the parameter can traverse the probability model space continuously from easy to difficult models.
We consider here the prediction setting where each predictor variable is binary. If we employ 2, 3, and 4 predictor variables then this would correspond to bin sizes , respectively. In functional genomics applications, these cases correspond to the gene prediction problem by using 2, 3, and 4 genes, where the activity of each gene is represented by binary gene expressions, for example, the on-and-off switch effect of a promoter.
Figure 1 makes apparent several facts. The resubstitution CoD is often optimistically biased, except at moderate to large CoD with (two binary predictors) whereas the other estimators are generally pessimistically biased. As the number of predictors increase, the bias (in magnitude) of the resubstitution CoD increases accordingly; however, its variance remains quite low in each case. The leave-one-out CoD is highly variable, in addition to being pessimistically biased. By observing the RMS, we conclude that the resubstitution CoD estimator is the best-performing estimator, except at small values of the actual CoD, beating all the other estimators, including the bootstrap. The leave-one-out CoD estimator is the worst-performing estimator for cases with small number of predictors ( ) whereas the cross-validation CoD estimator becomes the worst-performing estimator for large number of predictors and moderate actual CoD. As the number of predictors increases, the actual CoD cut off decreases accordingly at which the leave-one-out CoD estimator starts to outperform the cross-validation CoD estimator. It is also interesting to note that, for , only the bootstrap beats resubstitution, and for very small actual CoD. For , both bootstrap and cross-validation perform better than the resubstitution, for small actual CoD. For , all the other CoD estimators outperform resubstitution for small actual CoD. As the number of predictors increases, the cut-off at which the resubstitution CoD estimator beats all other estimators increases.
This paper presented a comprehensive study of CoD estimators. We derived for the first time exact analytical expressions of performance metrics of the resubstitution and leave-one-out CoD estimators. Using a parametric Zipf model, we have compared the exact performance metrics of resubstitution and leave-one-out between each other and against approximate performance metrics of cross-validation and bootstrap CoD estimators. Our results lead to a perhaps surprising conclusion: under the Zipf model under consideration, the resubstitution CoD estimator is the best-performing estimator among all, for moderate to large actual CoD and not too large number of predictors. However, for small actual CoD values and high classifier complexity, the other three CoD estimators can outperform resubstitution. This indicates that provided one has evidence of moderate to tight regulation between the genes, and the number of predictors is not too large, one should use the CoD estimator based on resubstitution.
This work is intended to serve as foundation for a detailed study of the application of CoD estimation in Genomics and related fields. An obvious application is the inference of genomic regulatory networks from sample microarray data. In addition to that, there are several issues related to nonlinear prediction in the discrete domain, which can benefit from the work presented here.
This work was supported by the National Science Foundation, through NSF award CCF-0845407.
- Dougherty ER, Kim S, Chen Y: Coefficient of determination in nonlinear signal processing. Signal Processing 2000, 80(10):2219-2235. 10.1016/S0165-1684(00)00079-7View ArticleMATHGoogle Scholar
- Martins DC Jr., Braga-Neto UM, Hashimoto RF, Bittner ML, Dougherty ER: Intrinsically multivariate predictive genes. IEEE Journal on Selected Topics in Signal Processing 2008, 2(3):424-439.View ArticleGoogle Scholar
- Kim S, Dougherty ER, Bittner ML, Chen Y, Sivakumar K, Meltzer P, Trent JM: A general nonlinear framework for the analysis of gene interaction via multivariate expression arrays. Journal of Biomedical Optics 2000, 5(4):411-424. 10.1117/1.1289142View ArticleGoogle Scholar
- Kim S, Dougherty ER, Chen Y, Sivakumar K, Meltzer P, Trent JM, Bittner M: Multivariate measurement of gene expression relationships. Genomics 2000, 67(2):201-209. 10.1006/geno.2000.6241View ArticleGoogle Scholar
- Shmulevich I, Dougherty ER, Kim S, Zhang W: Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics 2002, 18(2):261-274. 10.1093/bioinformatics/18.2.261View ArticleGoogle Scholar
- Zhou X, Wang X, Dougherty ER: Binarization of microarray data based on a mixture model. Molecular Cancer Therapeutics 2003, 2(7):679-684.Google Scholar
- Devroye L, Gyorfi L, Lugosi G: A Probabilistic Theory of Pattern Recognition. Springer, New York, NY, USA; 1996.View ArticleMATHGoogle Scholar
- Braga-Neto UM: Classification and error estimation for discrete data. Current Genomics 2009, 10(7):446-462. 10.2174/138920209789208228View ArticleGoogle Scholar
- Braga-Neto U, Dougherty E: Exact performance of error estimators for discrete classifiers. Pattern Recognition 2005, 38(11):1799-1814. 10.1016/j.patcog.2005.02.013View ArticleMATHGoogle Scholar
- Xu Q, Hua J, Braga-Neto U, Xiong Z, Suh E, Dougherty ER: Confidence intervals for the true classification error conditioned on the estimated error. Technology in Cancer Research and Treatment 2006, 5(6):579-589.View ArticleGoogle Scholar
- Smith CAB: Some examples of discrimination. Annals of Eugenics 1947, 18: 272-282.MathSciNetGoogle Scholar
- Lachenbruch PA, Mickey MR: Estimation of error rates in discriminant analysis. Technometrics 1968, 10: 1-11. 10.2307/1266219MathSciNetView ArticleGoogle Scholar
- Stone M: Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society. Series B 1974, 36: 111-147.MathSciNetMATHGoogle Scholar
- Efron B: Bootstrap methods: another look at the jackknife. Annals of Statistics 1969, 7: 1-26.MathSciNetView ArticleMATHGoogle Scholar
- Efron B: Estimating the error rate of a prediction rule: improvement on cross-validation. Journal of the American Statistical Association 1983, 78(382):316-331. 10.2307/2288636MathSciNetView ArticleMATHGoogle Scholar
- Zipf GK: Psycho-Biology of Languages. Houghton-Mifflin, Boston, Mass, USA; 1935.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.