- Research Article
- Open Access
Exact Performance of CoD Estimators in Discrete Prediction
EURASIP Journal on Advances in Signal Processing volume 2010, Article number: 487893 (2010)
The coefficient of determination (CoD) has significant applications in genomics, for example, in the inference of gene regulatory networks. We study several CoD estimators, based upon the resubstitution, leave-one-out, cross-validation, and bootstrap error estimators. We present an exact formulation of performance metrics for the resubstitution and leave-one-out CoD estimators, assuming the discrete histogram rule. Numerical experiments are carried out using a parametric Zipf model, where we compute exact performance metrics of resubstitution and leave-one-out CoD estimators using the previously derived equations, for varying actual CoD, sample size, and bin size. These results are compared to approximate performance metrics of 10-repeated 2-fold cross-validation and 0.632 bootstrap CoD estimators, computed via Monte Carlo sampling. The numerical results lead to a perhaps surprising conclusion: under the Zipf model under consideration, and for moderate and large values of the actual CoD, the resubstitution CoD estimator is the least biased and least variable among all CoD estimators, especially at small number of predictors. We also observed that the leave-one-out and cross-validation CoD estimators tend to perform the worst, whereas the performance of the bootstrap CoD estimator is intermediary, despite its high computational complexity.
The coefficient of determination (CoD) has significant applications in genomics, for example, in the inference of gene regulatory networks. We study several CoD estimators, based upon theresubstitution, leave-one-out, cross-validation, and bootstrap error estimators. We present an exact formulation of performance metrics for the resubstitution and leave-one-out CoD estimators, assuming the discrete histogram rule. Numerical experiments are carried out using aparametric Zipf model, where we compute exact performance metrics of resubstitution and leave-one-out CoD estimators using the previously derived equations, for varying actual CoD, sample size, and bin size. These results are compared to approximate performance metrics of10-repeated 2-fold cross-validation and 0.632 bootstrap CoD estimators, computed via Monte Carlo sampling. The numerical results lead to a perhaps surprising conclusion: under the Zipf model under consideration, and for moderate and large values of the actual CoD,the resubstitution CoD estimator is the least biased and least variable among all CoD estimators, especially at small number of predictors. We also observed that the leave-one-out andcross-validation CoD estimators tend to perform the worst whereas the performance of the bootstrap CoD estimator is intermediary, despite its high computational complexity.
In classical regression analysis, the nonlinear coefficient of determination (CoD) gives the relative decrease in unexplained variability when entering a variable into the regression of the dependent variable , in comparison with the total unexplained variability when entering no variables. Applying this to pattern prediction, Dougherty and collaborators  introduced a very similar concept, that of CoD for binary random variables, which measures the predictive power of a set of predictor variables with respect to a target variable , as given by
where is the error of the best predictor of in the absence of other observations and is the error of the best predictor of based on the observation of . The binary CoD measures the relative decrease in prediction error when using predictor variables to estimate the target variable, as opposed to using no predictor variables. The closer it is to one, the tighter the regulation of the target variable by the predictor variables is, whereas the closer it is to zero, the looser the regulation is. The CoD will correctly produce low values in cases where the no-predictor error is already small, or when adding predictors does not contribute to a significant decrease in error. The CoD is a function only of the joint distribution between predictors and target, thus it characterizes the regulatory relationship among them.
The concept of CoD has far-reaching applications in Genomics. The CoD was perhaps the first predictive paradigm utilized in the context of microarray data, the goal being to provide a measure of nonlinear interaction among genes [1–6]. In [2, 4, 6], the CoD is applied to the prediction problem dealing with gene expressions quantized into discrete levels in discrete prediction. In [3, 5], the CoD has its application in the reconstruction or inference of gene regulatory networks. As its classic counterpart, the binary CoD is a goodness-of-fit statistic that can be used to assess the relationship between predictor and target variables, for example, the associations between gene expression patterns in practical applications. The CoD permits biologists to focus on particular connections in the genome, and the estimated coefficients provide a practical criterion for selecting among potential predictor sets .
The error of the best predictor corresponds to the optimal prediction error, also known as Bayes error, which depends only on the underlying probability model . However, in practical real-world problems, the underlying probability model is unknown, and thus we arrive at the fundamental issue of how to find a good prediction error estimator in small-sample settings [8, 9]. An error estimator may be a deterministic function of the sample data, in which case it is called a nonrandomized error estimator; such popular error estimators as resubstitution and leave-one-out are examples. These error estimators are random only through the random sample data. Closed-form analytical expressions for performance metrics such as bias, deviation variance, and RMS of resubstitution and leave-one-out error estimators have been given in [9, 10]. By contrast, randomized error estimators, like cross-validation and bootstrap, have "internal" random factors that affect their outcome, and thus approximate approaches, usually via Monte Carlo sampling, are typically used to analyze their performance.
Likewise, the CoD must in practice be estimated from sample data. A CoD estimator is obtained from (1) by using one of the usual error estimators for the prediction error with variables , and the empirical frequency estimator for the prediction error with no variables ; we may speak thus of non-randomized CoD estimators, including the resubstitution and leave-one-out CoD estimators, and randomized CoD estimators, including bootstrap and cross-validation CoD estimators. The CoD with the true values of and in (1) will be called in this paper the "actual CoD." We will employ the discrete histogram rule [7, 8], the most widely used and intuitive rule for discrete prediction problems, in order to estimate prediction errors and CoDs from the sample data.
This paper presents, for the first time, an exact formulation for performance metrics of the resubstitution and leave-one-out CoD estimators, for the discrete histogram rule. Numerical experiments are carried out using a parametric Zipf model, where we compute the exact performance of resubstitution and leave-one-out CoD estimators using the previously derived equations, for varying actual CoD, sample size, and bin size. We compare these results to approximate performance metrics of randomized CoD estimators (bootstrap and cross-validation), computed via Monte Carlo sampling. The numerical results indicate that, under the Zipf model under consideration, and for moderate and large values of the actual CoD, the resubstitution CoD estimator is the least biased and least variable among all CoD estimators, especially at small number of predictors. In fact, with two predictors, the resubstitution CoD nearly dominates uniformly over all other estimators across all values of actual CoD. The leave-one-out and cross-validation CoD estimator tend to perform the worst whereas the performance of the bootstrap CoD estimator is intermediary, despite its high computational complexity. This indicates that provided one has evidence of moderate to tight regulation between the genes, and the number of predictors is not too large, one should use the CoD estimator based on resubstitution.
This paper is organized as follows. In section 2, the probability model used in discrete prediction is introduced. In section 3, the discrete histogram rule is recalled, and formal definitions are given for the actual CoD and several CoD estimators, including two non-randomized CoD estimators (i.e., resubstitution and leave-one-out) and two randomized CoD estimators (i.e.,.632 bootstrap and 10-repeated 2-fold cross-validation). Section 4 introduces performance metrics (i.e., bias, deviation variance, RMS) of a CoD estimator, and Section 5 presents an analytical formulation of exact performance metrics of the resubstitution and leave-one-out CoD estimators. In Section 6, we present numerical results, based on a parametric Zipf model, that compare the performance metrics of all the CoD estimators considered in this paper. Finally, Section 7 presents concluding remarks.
2. Discrete Prediction
Let be predictor random variables, such that each take on a finite number of values, and be the target random variable, for the discrete prediction problem. The predictors as a group can take on values in a finite space with possible states. For analysis purposes, we establish a bijection between this finite state space and a single predictor variable taking values in the set . The variable has a one-to-one relationship with the finite space state coded by : one specific value of represents a specific combination of the values of the original predictors, that is, a "bin" into which the data is categorized. The value is the number of bins, which provides a direct measure of predictor complexity.
The probability model for the pair is specified by class prior probabilities: , and class-conditional probabilities: and , for , where we have the identities
Given a specific probability model, the optimal predictor for the problem is given by
with optimal error rate, also called the Bayes error , determined by
If no features are provided, the optimal error rate becomes
By using the simple inequality , one concludes that in all cases.
The coefficient of determination  is defined as (assuming that )
Since , we have that . We have if and only if , that is, there is perfect regulation between predictors and target. On the other hand, if and only if , that is, the predictors exert no regulation on the target.
3. CoD Estimation
In practice, the underlying probability model is unknown, and thus the CoD is not known. The need arises thus to find estimators of the CoD from i.i.d. sample data drawn from the unknown probability model distribution. All CoD estimators considered here will be of the form
where is one of the usual error estimators for a selected discrete prediction rule, and is the empirical frequency estimator for the prediction error with no variables
where and are random variables corresponding to the number of sample points belonging to classes and , respectively. We assume throughout that , that is, each class is represented by at least one sample. Note that has the desirable property of being a universally consistent estimator of in (5), that is, in probability (in fact, almost surely) as , regardless of the probability model.
The discrete prediction rule to be used with the error estimator is the discrete histogram rule, which is the "plug-in" rule for approximating the minimum-error Bayes predictor . Even though we make this choice, we remark that the methods described here can be applied to any discrete prediction rule. Given the sample data , the discrete histogram classifier is given by
where is the number of samples with in bin , and is the number of samples with in bin , for .
We review next some facts about the distribution of the random vectors = and = , which will be needed in the sequel. The variables , , , and , for , are random variables due to the randomness of the sample data (this is the case referred to as "full sampling" in ). More specifically, is a random variable binomially distributed with parameters , that is, , for , while the vector-valued random variable is trinomially distributed with the parameter set , that is,
for . In addition, the vector follows a multinomial distribution with parameters , so that
We introduce next each of the CoD estimators considered in this paper.
3.1. Resubstitution CoD Estimator
This corresponds to the choice of resubstitution  as the prediction error estimator
where, for the discrete histogram predictor,
The resubstitution CoD can be written equivalently as
which reveals that has the desirable property of being a universally consistent estimator of in (6), that is, in probability (in fact, almost surely) as , regardless of the probability model.
3.2. Leave-One-Out CoD Estimator
This corresponds to the choice of the leave-one-out error estimator  as the prediction error estimator
where, for the discrete histogram predictor (as can be readily checked)
The leave-one-out CoD estimator provides an opportunity to reflect on the uniform choice of the empirical frequency estimator in (8) as an estimator of , including here. Clearly, the empirical frequency corresponds to the resubstitution estimator of . The question arises as to whether, for the leave-one-out CoD estimator, the leave-one-out error estimator of should be used instead. For , we get with the choice of the resubstitution estimator (empirical frequency), but with the choice of leave-one-out estimator, which is a useless result. Similar problems beset other estimators of . Hence, the empirical frequency estimator is employed here as the estimator of for all CoD estimators.
3.3. Cross-Validation CoD Estimator
This corresponds to the choice of the cross-validation error estimator [12, 13] as the prediction error estimator. In -fold cross-validation, sample data is partitioned into folds , for . For simplicity, we assume that can divide . A classifier is designed on the training set , and tested on , for . Since there are different partitions of the data into folds, one can repeat the -fold cross-validation times and then average the results. Such a process leads to the -repeated -fold cross-validation error estimator , given by
where represents the th sample point in the th fold for the -th repetition of the cross-validation, for , and .
Based upon (17), the -repeated -fold cross-validation CoD estimator is defined by
In order to get reasonable variance properties, a large number of repetitions may be required, which can make the cross-validation CoD estimator slow to compute.
3.4. Bootstrap CoD Estimator
This corresponds to the use of the bootstrap [14, 15] for the prediction error estimator. A bootstrap sample consists of equally-likely draws with replacement from the original data . Some sample points from the original data may appear multiple times in the bootstrap sample whereas other sample points may not appear at all. The actual proportion of times a sample point appears in can be written as , for . A predictor may be designed on a bootstrap sample , and tested on , for , where is a sufficiently large number of repetitions (in this paper, ). Then, the basic bootstrap zero estimator is given by
The bootstrap estimator then performs a weighted average of the bootstrap zero and resubstitution estimators
Based on (19) and (20), the bootstrap CoD estimator is then defined as
The bootstrap CoD estimator can be very slow to compute due to the complexity of .
4. Performance Metrics of CoD Estimators
In analogous fashion to the performance metrics of prediction error estimators , the key performance metrics for an CoD estimator are its bias
the deviation variance (which in the present case is equal simply to its variance)
and the root mean-square (RMS) error
For a given probability model, all the performance metrics are thus obtained as a function of the expectation and variance .
Working further, we obtain
as can be easily checked. We conclude that all the key performance metrics for CoD estimators can be obtained from the first and second moments of .
5. Exact Moments of Nonrandomized CoD Estimators
As mentioned in the Introduction, we can categorize CoD estimators into non-randomized and randomized, depending on whether the prediction error estimator is non-randomized or randomized. Non-randomized CoD estimators, such as the resubstitution and leave-one-out CoD estimators, are deterministic functions of the sample data, which makes it possible an analytical formulation of their performance metrics. On the other hand, the performance of randomized CoD estimators, such as the cross-validation and bootstrap CoD estimators, is very difficult to study analytically and is typically investigated via Monte Carlo sampling (which is done in Section 6).
In this section, we will present exact expressions for the computation of the first moment and the second moment for the case of resubstitution and leave-one-out error estimators, which suffices to compute the bias, variance, and RMS of the corresponding CoD estimator, as discussed in the previous section. These expressions are functions only of sample size, number of bins (complexity), and the probability model. We will assume throughout, for definiteness, that the sample size is even. The case where is odd is in fact slightly simpler and can be readily obtained in analogous fashion to the derivations presented below.
The first moment of is given by
where . Since , we have . It follows that the event is equal to the union of the disjoint events and , for whereas . By using Proposition 1 in the appendix, we can write both cases in a single expression as follows:
By using (28) in (27) and considering that , we obtain
The second moment of is given by
where , as before. By using Proposition 1 in the appendix, and the same reasoning applied previously in the case of the first moment, we can write
Combining (33) and (32) leads to
with as in (31) and
To obtain the first moment of , one can proceed exactly as in the resubstitution case to get
with as in (31), for .
To obtain the second moment of , one can again proceed as in the resubstitution case to get
with as in (31) and as in (36), for .
6. Numerical Experiments
Assuming a parametric probability model in this section, we plot the exact performance metrics of the resubstitution and leave-one-out CoD estimators, by using the analytical expressions obtained in Sections 4 and 5, under varying actual CoD, sample size, and predictor complexity (number of bins). We also compare these exact performance metrics with the approximate performance metrics for cross-validation and bootstrap CoD estimators computed via Monte Carlo sampling. The Monte Carlo computation was carried out by drawing simulated training data sets of the required sample size from the probability model in each case, and employing sample means and sample variances to approximate the performance metrics in Section 4.
The probability model used here is a parametric Zipf model . The class-conditional probabilities under the parametric Zipf model are given by
for , and . The normalizing constant is given by
For simplicity, we assume that . It can be seen easily from (6) that the CoD increases monotonically with , so that large leads to tight regulation, that is, easy prediction, and vice versa. There are two extreme cases. When , there is maximal confusion between the classes, and . When , there is maximal discrimination between the classes, and . Thus, varying the parameter can traverse the probability model space continuously from easy to difficult models.
We consider here the prediction setting where each predictor variable is binary. If we employ 2, 3, and 4 predictor variables then this would correspond to bin sizes , respectively. In functional genomics applications, these cases correspond to the gene prediction problem by using 2, 3, and 4 genes, where the activity of each gene is represented by binary gene expressions, for example, the on-and-off switch effect of a promoter.
Figure 1 displays bias, variance, and RMS of the CoD estimators considered here, as a function of varying actual CoD (computed by suitable tuning the parameter ). We recall that, in the figure, tight regulation, that is, easy prediction, is located on the right of these plots whereas loose regulation, that is, difficult prediction, is located on the left.
Figure 1 makes apparent several facts. The resubstitution CoD is often optimistically biased, except at moderate to large CoD with (two binary predictors) whereas the other estimators are generally pessimistically biased. As the number of predictors increase, the bias (in magnitude) of the resubstitution CoD increases accordingly; however, its variance remains quite low in each case. The leave-one-out CoD is highly variable, in addition to being pessimistically biased. By observing the RMS, we conclude that the resubstitution CoD estimator is the best-performing estimator, except at small values of the actual CoD, beating all the other estimators, including the bootstrap. The leave-one-out CoD estimator is the worst-performing estimator for cases with small number of predictors () whereas the cross-validation CoD estimator becomes the worst-performing estimator for large number of predictors and moderate actual CoD. As the number of predictors increases, the actual CoD cut off decreases accordingly at which the leave-one-out CoD estimator starts to outperform the cross-validation CoD estimator. It is also interesting to note that, for , only the bootstrap beats resubstitution, and for very small actual CoD. For , both bootstrap and cross-validation perform better than the resubstitution, for small actual CoD. For , all the other CoD estimators outperform resubstitution for small actual CoD. As the number of predictors increases, the cut-off at which the resubstitution CoD estimator beats all other estimators increases.
In order to assess the performance of the resubstitution CoD estimator and the remaining CoD estimators with respect to the classifier complexity (number of predictors), we display the performance metrics as a function of varying number of bins in Figures 2 and 3, for sample size , and , and moderate CoD and large CoD . The bias column shows that, for CoD , the resubstitution CoD is actually slightly pessimistically biased for (a perhaps surprising fact, given the optimistic bias of resubstitution in discrete classification), but quickly becomes optimistically biased for larger bin sizes. In the RMS column, we can see that the resubstitution CoD always beats all other estimators, especially in the case of CoD (tight regulation), which is the more surprising when we consider that the other estimators are much more computation-intensive. It is interesting to see that the leave-one-out CoD estimator beats the more complex cross-validation CoD estimator for small number of bins and large sample size. The resubstitution CoD is the least biased and least variable among all CoD estimators, across the whole range of classifier complexity and sample size considered here, and thus it also displays the best RMS overall.
In Figures 4 and 5, we examine how these performance metrics behave with varying sample sizes for , and moderate CoD and large CoD . As expected, bias (in magnitude), variance and RMS all decrease as sample size increases. We can see that the resubstitution CoD is the least biased and least variable among all estimators, and thus also displays the best RMS. The cross-validation CoD estimator is the most biased, and the leave-one-out CoD estimator is the most variable, among all CoD estimators. The bootstrap CoD estimator is less variable than the cross-validation CoD estimator.
This paper presented a comprehensive study of CoD estimators. We derived for the first time exact analytical expressions of performance metrics of the resubstitution and leave-one-out CoD estimators. Using a parametric Zipf model, we have compared the exact performance metrics of resubstitution and leave-one-out between each other and against approximate performance metrics of cross-validation and bootstrap CoD estimators. Our results lead to a perhaps surprising conclusion: under the Zipf model under consideration, the resubstitution CoD estimator is the best-performing estimator among all, for moderate to large actual CoD and not too large number of predictors. However, for small actual CoD values and high classifier complexity, the other three CoD estimators can outperform resubstitution. This indicates that provided one has evidence of moderate to tight regulation between the genes, and the number of predictors is not too large, one should use the CoD estimator based on resubstitution.
This work is intended to serve as foundation for a detailed study of the application of CoD estimation in Genomics and related fields. An obvious application is the inference of genomic regulatory networks from sample microarray data. In addition to that, there are several issues related to nonlinear prediction in the discrete domain, which can benefit from the work presented here.
Dougherty ER, Kim S, Chen Y: Coefficient of determination in nonlinear signal processing. Signal Processing 2000, 80(10):2219-2235. 10.1016/S0165-1684(00)00079-7
Martins DC Jr., Braga-Neto UM, Hashimoto RF, Bittner ML, Dougherty ER: Intrinsically multivariate predictive genes. IEEE Journal on Selected Topics in Signal Processing 2008, 2(3):424-439.
Kim S, Dougherty ER, Bittner ML, Chen Y, Sivakumar K, Meltzer P, Trent JM: A general nonlinear framework for the analysis of gene interaction via multivariate expression arrays. Journal of Biomedical Optics 2000, 5(4):411-424. 10.1117/1.1289142
Kim S, Dougherty ER, Chen Y, Sivakumar K, Meltzer P, Trent JM, Bittner M: Multivariate measurement of gene expression relationships. Genomics 2000, 67(2):201-209. 10.1006/geno.2000.6241
Shmulevich I, Dougherty ER, Kim S, Zhang W: Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics 2002, 18(2):261-274. 10.1093/bioinformatics/18.2.261
Zhou X, Wang X, Dougherty ER: Binarization of microarray data based on a mixture model. Molecular Cancer Therapeutics 2003, 2(7):679-684.
Devroye L, Gyorfi L, Lugosi G: A Probabilistic Theory of Pattern Recognition. Springer, New York, NY, USA; 1996.
Braga-Neto UM: Classification and error estimation for discrete data. Current Genomics 2009, 10(7):446-462. 10.2174/138920209789208228
Braga-Neto U, Dougherty E: Exact performance of error estimators for discrete classifiers. Pattern Recognition 2005, 38(11):1799-1814. 10.1016/j.patcog.2005.02.013
Xu Q, Hua J, Braga-Neto U, Xiong Z, Suh E, Dougherty ER: Confidence intervals for the true classification error conditioned on the estimated error. Technology in Cancer Research and Treatment 2006, 5(6):579-589.
Smith CAB: Some examples of discrimination. Annals of Eugenics 1947, 18: 272-282.
Lachenbruch PA, Mickey MR: Estimation of error rates in discriminant analysis. Technometrics 1968, 10: 1-11. 10.2307/1266219
Stone M: Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society. Series B 1974, 36: 111-147.
Efron B: Bootstrap methods: another look at the jackknife. Annals of Statistics 1969, 7: 1-26.
Efron B: Estimating the error rate of a prediction rule: improvement on cross-validation. Journal of the American Statistical Association 1983, 78(382):316-331. 10.2307/2288636
Zipf GK: Psycho-Biology of Languages. Houghton-Mifflin, Boston, Mass, USA; 1935.
This work was supported by the National Science Foundation, through NSF award CCF-0845407.
Proposition 1. For a discrete random variable and disjoint events and , one has
About this article
Cite this article
Chen, T., Braga-Neto, U. Exact Performance of CoD Estimators in Discrete Prediction. EURASIP J. Adv. Signal Process. 2010, 487893 (2010). https://doi.org/10.1155/2010/487893
- Performance Metrics
- Error Estimator
- Gene Regulatory Network
- Monte Carlo Sampling
- Zipf Model