 Research
 Open Access
Using discretization for extending the set of predictive features
 Avi Rosenfeld^{1}Email author,
 Ron Illuz^{1},
 Dovid Gottesman^{1} and
 Mark Last^{2}
https://doi.org/10.1186/s136340180528x
© The Author(s) 2018
 Received: 23 May 2017
 Accepted: 4 January 2018
 Published: 18 January 2018
Abstract
To date, attribute discretization is typically performed by replacing the original set of continuous features with a transposed set of discrete ones. This paper provides support for a new idea that discretized features should often be used in addition to existing features and as such, datasets should be extended, and not replaced, by discretization. We also claim that discretization algorithms should be developed with the explicit purpose of enriching a nondiscretized dataset with discretized values. We present such an algorithm, DMIAT, a supervised algorithm that discretizes data based on minority interesting attribute thresholds. DMIAT only generates new features when strong indications exist for one of the target values needing to be learned and thus is intended to be used in addition to the original data. We present extensive empirical results demonstrating the success of using DMIAT on 28 benchmark datasets. We also demonstrate that 10 other discretization algorithms can also be used to generate features that yield improved performance when used in combination with the original nondiscretized data. Our results show that the best predictive performance is attained using a combination of the original dataset with added features from a “standard” supervised discretization algorithm and DMIAT.
1 Introduction
Discretization is a data preprocessing technique that transforms continuous attributes into discrete ones. This is accomplished by dividing each numeric attribute A into m discrete intervals where D={[d_{0},d_{1}],(d_{1},d_{2}],…,(d_{m−1},d_{ m }]} where d_{0} is the minimal value, d_{ m } is the maximal value and d_{ i }<d_{i+1} for i = 0,1,..., m–1. The resulting values within D constitute a discretization scheme for attribute A and P={d_{1},d_{2},...,d_{m−1}} is A’s set of cut points. Traditionally, discretization has been used in place of the original values such that after preprocessing the data, D is used instead of A [1].
Two basic types of discretization exist, supervised and unsupervised. Unsupervised discretization divides each A into a fixed number of intervals within D, typically through equal–width (EW) or equal–frequency (EF) heuristics [2]. Supervised discretization further considers the target class, C, in creating D. One popular supervised discretization algorithm is based on information entropy maximization (IEM) whereby the set of cut points is created to minimize the entropy within D [3]. An interesting byproduct of IEM, and other supervised discretization methods, is that if no cut points are found according to the selection criteria, then only one bin is created for that variable, effectively eliminating A from the dataset as D is the null set. In this way, supervised discretization can also function as a type of feature selection [4]. A large number of supervised discretization algorithms have been developed in addition to IEM, including Ameva, CAIM, CACC, ModifiedChi2, HDD, IDD, and urCAIM [5–11]. We refer the reader to a recent survey [1] for a detailed comparison of these and other algorithms.
While discretization was originally performed as a necessary preprocessing step for certain supervised machinelearning algorithms which require discrete feature spaces [3], several additional+ advantages have since been noted. First, discretization can at times improve the performance of some classification algorithms that do not require discretization. Consequently, multiple studies have used D instead of A to achieve improved prediction performance [3, 12–14]. Most recently, this phenomenon has particularly been noted within medical and bioinformatic datasets [12–15]. It has been hypothesized that the source for this improvement is due to the attribute selection component evident within supervised discretization algorithms [13]. A second advantage of discretization is that it contributes to the interpretability of the machinelearning results, as people can better understand the connection between different ranges of values and their impact on the learned target [3, 13].
This paper’s first claim is that discretization should not necessarily be used to replace a dataset’s original values but instead to generate new features that can augment the existing dataset. As such, each discretized feature D should be used in addition to the original feature A. Using discretization in this fashion is to the best of our knowledge a completely novel idea. We claim that using datasets with both A and D can often improve the performance of a classification algorithm for a given dataset.
This claim is somewhat surprising and may not seem intuitive. It is widely established that reducing the number of features through feature selection, and by extension using discretization for feature selection, helps to improve prediction performance [16]. At the core of this claim is that the curse of dimensionality can be overcome by reducing the number of candidate predictive attributes. Counter to this claim, we posit that it is not the number of attributes that is problematic, but the lack of added value within those attributes. At times, the information gained from the discretized attributes is significant, to the point that its addition to the original attributes improves prediction accuracy.
This paper also contains a second claim that discretization algorithms should be explicitly developed for the purpose of augmenting the original data. As support for this point, we present DMIAT, an algorithm that discretizes numeric data based on minority interesting attribute thresholds. We believe DMIAT to be unique as to the best of our knowledge, it is the first discretization algorithm that explicitly extends a set of features through discretization. DMIAT generates features where only the minority of an attribute’s values strongly point to one of the target classes. We claim that at times it can be important to create discretized features with such indications. However, attribute selection approaches to date typically treat all values within a given attribute equally, and thus focus on the general importance of all values within a given attribute, or combinations of the full set of different attributes’ values [17, 18]. Hence, these approaches would typically not focus on cut points based on strong indications within only a small subset of values. Once again, the potential success of DMIAT may see counterintuitive as it generates features in addition to the original dataset, something often believed to reduce performance [16].
To support these claims we studied the prediction accuracy within the 28 datasets of a previous discretization study [11]. We considered the performance of seven different classification algorithms: Naive Bayes, SVM, Knn, AdaBoost, C4.5, Random Forest, and Logistic regression, on five different types of datasets. First, we considered the accuracy of the original dataset (A) without any discretization. Second, we created a dataset with the original data combined with the features generated by DMIAT and studied where this combination was more successful. Third, we compared the accuracy of the baseline datasets (A) to the discretized datasets (D) from the 10 algorithms previously considered [11], using the training and testing data from that paper. Somewhat surprisingly, we found that prediction accuracy on average decreased when only the discretized data was considered. Based on the understanding that discretized features can improve performance, we then created a fourth dataset which appended the original data to the features generated by each of the 10 canonical discretization algorithms, creating 10 new combined datasets based on A and D. Again, we noted that the combination of the discretized features with the original data has improved predictive performance. Fifth, we studied how combinations of features from different discretization algorithms can be created. Specifically, we created a dataset that combined the discretized values of DMIAT and the three discretization algorithms with the best prediction performance. The combination of the three types of features, the original data, those of the “standard” discretization algorithms and DMIAT, performed the best.
2 The DMIAT Algorithm
Contrary to other discretization algorithms, the DMIAT was explicitly developed to augment the original data with discretized features. It will only generate such features when strong indications exist for one of the target classes, even within a relatively small subsets of values within A. Our working hypothesis is that these generated features improve prediction accuracy by encapsulating features that classification algorithms could miss in the original data or using “standard” discretization methods.
This assumption is motivated by recent findings that at times values of important subsets of attributes can serve as either “triggers” or “indicators” for biological processes. Recent genomic (DNA) and transciptomic (RNA) research has shown that some people may have a natural predisposition or immunity towards certain diseases [19]. Similarly, we posit that even small subsets of values pointing to one of the target classes can be significant, even within nonmedical datasets. The success of DMIAT is in finding these subsets and thus we shift our focus from studying all attribute values, as has been done to date by other discretization algorithms, to finding those important subsets of values within the range of a given attribute.
To make this general idea clearer, consider the following example. Assume that attribute A is a numeric value for how many cigarettes a person smokes in a given day. The dataset contains a total population of 1000 where only 10% smoked more than three cigarettes a day. Most of these heavier smokers develop cancer (say 95 of 100 people) while the cancer rates within the remaining 90% (900 people) are not elevated. Traditional discretization will analyze all values within the attribute equally and may thus ignore this relatively small, but evidently important, subset of this dataset. Methods such as IEM that discretize based on the overall effectiveness of the discretization criteria; here, entropy reduction will not find this attribute interesting as this subset is not necessarily large enough for a significant entropy reduction within the attribute. Accordingly, it will not discretize this attribute, effectively removing it from the dataset. Similarly, even discretization algorithms with selection measures not based on information gain will typically ignore the significance of this subset of minority attribute values due to its size, something that DMIAT was explicitly designed to do.
Specifically, we define the DMIAT algorithm as follows. We assume that n numeric attributes exist in the dataset, which are denoted A_{1}…A_{ n }. Each given attribute, A_{ j }, has a continuous set of values, which can be sorted and then denoted as val_{1}…val_{ b }. There are c target values (class labels) within the target variable C with c>=2. A new discretized cut will be created if a subset of a minimum size supp exists with a strong indication, conf, for one of the target values within C. As per DMIAT’s assumption, we only attempt to create up to two cuts within A_{ j } between the minimum val_{1} to d_{1} and between the maximum value val_{ b } and d_{2}. In both cases, the size of each subset needs to contain at least supp records.
We considered two types of “strong” indications as support, one type based on entropy and one based on lift. Most similar to IEM, we considered confidence thresholds based on entropy. Entropy can be defined as \(\sum \limits _{i=1}^{C} p_{i} {log}_{2} p_{i}\) where p_{ i } here is the relative size of C_{ i } within C. This value can be used a threshold to quantify the support as the percentage of values within a given attribute that points to a given target C_{ i }. For example, if the smallest 10% of values of A_{1} point to the first target value C_{1}, then the entropy within this subset is 0 and thus constitutes a strong indication. A second type of confidence is based on lift, which is typically defined as \(\frac {p(xc)}{p(x)}\) where p(xc) is the conditional probability for x given c divided by the general probability of x. We use lift in this content as a support measure for the relative strength of subset for a target value given its general probability. For example, let us again consider a subset with the lowest 10% of values within A_{1} with respect to C_{1}. The relative strength of a subset can here be defined as the probability of P([val_{1}…d_{1}])C_{1}/P([val_{1}…d_{1}]). Assuming that the general probability of C_{1} is 0.05 as it occurs 50 times out of 1000 records of which 40 of these occurrences are within the 100 lowest values of A_{1} and only 10 additional times within the other 900 values, then its lift will be (40/100)/(50/1000) or 8. Assuming DMIAT’s value for the conf parameter is less than 8, then this subset would be considered significant and DMIAT will create a discretized cut with that subset.
It is worth noting that the two parameters within DMIAT, supp and conf, are motivated from association rule learning [20]. In both cases, we attempt to find even relatively small subsets of values, however, above a minimum amount of support supp. Similarly, in both cases, we refer to a confidence threshold that can be defined as the absolute and relative number of instances within a subset corresponding to a target value. To the best of our knowledge, DMIAT is the first discretization algorithm using support and confidence to find these significant subsets of attribute values.
Based on these definitions, Algorithm 1 presents DMIAT. Line 1 begins with creating a null set of discretized features D that will be extracted from the full set of attributes, A. Lines 2–4 loop through all attributes within the dataset (A), sort the continuous values within each attribute A_{ j }, and consider each target variable C_{ i }. We then consider two discretization ranges, one beginning with the smallest value of A_{ j } and one ending with the largest value in A_{ j } (lines 6 and 11). The algorithm uses a binary search (BinarySearch) to find potential cut points based on the selection criteria, conf. Trivially, this step could end with bound1 being equal to the smallest value within A_{ j } and bound2 being A_{ j }’s largest value. Typically, the subset will be beyond these two points. Regardless, lines 7 and 12 check if the number of records within this interval is larger than the predefined support threshold, supp. If so, a new discretized variable is created. In our implementation, the new discretized variable will have values of 1 for all original values within [ A_{1},A_{bound1}] and zero for (A_{bound1},b] or a value of 1 for [ A_{bound2},A_{ b }] and zero for [ A_{1},A_{bound2}). In lines 9 and 14 we add these new cuts to the generated features within D. Thus, this algorithm can potentially create two cuts for every value of A_{ j }, but will typically create much fewer as per the stringent requirements typically assigned to supp and conf by association rule algorithms. Line 18 returns the new dataset combining the original attributes in A with the new discretized features from DMIAT.
The motivation for DMIAT’s attempt to potentially create two cuts at the extremes, and not to search for support within any subset as per association rule theory, is based on medical research. One motivation for DMIAT comes from the observation that the probability distribution of features relating to any medical classification problem need not be unimodal. In fact, as discussed in the previous smoker example, features are likely to be multi–modal with some modes having very low representation in the sampled dataset, but within those modes prevalence of one class may be significantly different from that for the remainder of the samples. Furthermore, we expect to find that the discretized cuts at the extreme values for A_{ j } would have the highest amount of interpretability—one of the main general motivations behind discretization [3, 13].
A sample dataset with DMIAT applied

For comparison, we present in columns 6–8 the discretized cuts from the IEM algorithm for these three attributes. The IEM algorithm is based on the use of the entropy minimization heuristic for discretization of a continuous value into 0, 2, or more intervals. For the Ba attribute it minimized the overall entropy with one cut at 0.4158, and thus created two intervals—either less than or greater than this value. For the Ca attribute, four intervals were created and for Fe no intervals were created (represented by the uniform value of “All” in this implementation). This example highlights both the similarities and differences between DMIAT and other discretization algorithms. First, DMIAT is a binary discretization method– each cut DMIAT creates will only have two intervals, one where supp and conf are met and one where they are not met. In contrast, IEM, and other classic discretization algorithms maximize a score over all attribute values (such as entropy reduction for IEM). Thus, these algorithms often choose different numbers of cuts (such as IEM creating for four different interval cuts for Ca) and thresholds for these cuts (0.09 for DMIAT within Ba and 0.4158 for IEM).
These different cut values impact the interpretability of the results. As DMIAT focuses on a subset, it will focus one’s analysis on a range of values either below a given threshold (if the subset range starts at the minimum value) or above a given threshold (if the subset range ends at the maximum value). In Table 1, we see both examples. Please note that the first DMIAT cut focuses one’s attention on the higher range of values (Ba > 0.09), where a strong indication exists for the target value of 7. The second cut focuses one’s attention on the smaller range of values (Ca < 8.4), where a strong indication exists for the target value of 2. Often these algorithms agree and will both find nothing of interest. Note that Fe had no discretized cuts formed by either DMIAT or IEM as both algorithms’ conditions for creating cuts were not present for this attribute. As we explain below, all experiments were generated with the discretization cuts being generated only on the training data and then applied to the testing data.
3 Experimental results
We used the same 28 datasets within a recent discretization study [11]. These datasets were collected from the the KEEL [21] and UCI [22] machine learning repositories and represent a variety of complexity, number of classes, number of attributes, number of instances, and imbalance ratio (ratio of the size of the majority class to the minority class). Detailed information about the datasets can be found online^{1}. We downloaded the discretized versions of these datasets for the discretization algorithms they considered. This included the unsupervised algorithms of equal–width (EW) and equal–frequency (EF) [2], and the supervised algorithms of information entropy maximization (IEM) [3], classattribute interdependence maximization (CAIM) [5], Ameva [6], ModifiedChi2 [7], Hypercube Divisionbased Discretization (HDD) [8], ClassAttribute Contingency Coefcient (CACC) [9], Interval DistanceBased Method for Discretization (IDD) [10] and urCAIM [11]. These datasets each contain 100 different folds whereby the discretized intervals were determined by the training portion in the first 90% of the file and these intervals were then applied to the remaining 10% of the data used for testing. Thus, the 28 datasets contained 10 independently constructed training and testing components to effectively create 10fold cross validation for a total of 280 training and testing pairs.
 1.
Do the features generated by DMIAT improve prediction performance when adding them to the original data?
 2.
Does the classic use of discretization of removing the continuous values help improve performance?
 3.
Is it advantageous to use the discretized features in addition to the original ones?
 4.
Should the features generated by DMIAT be used in conjunction with other discretization algorithms?
In order to check these issues we then proceeded to create five different sets of data. The first set of data was composed of the 28 base datasets without any modification to their training and testing components (A) and used the data from a previous study [11]. The second set of data was the original data in addition to the discretized cuts (D) from DMIAT. The third set of data consisted of the discretized versions of the 28 datasets with the 10 abovementioned algorithms (D) and were also generated based on a previous study [11]. The fourth set of data appended the original datasets (A) with each of the discretized datasets (D). Last, we created a fifth set of data by appending the original data to the features created by DMIAT and several other best performing discretization algorithms. To facilitate replicating these experiments in the future, we have made a Matlab version of DMIAT available at: http://homedir.jct.ac.il/~rosenfa/DMIAT.zip^{2}. We ran DMIAT on a personal computer with an Intel i7 processor and 32 GB of memory. Processing the full set of 280 files with DMIAT took a total of approximately 15 min.
We then measured the predictive accuracy of 7 different classification algorithms: Naive Bayes, SVM using the RBF kernel, Knn using the value of k=3, AdaBoost, C4.5, Random Forest, and Logistic regression on each of these datasets. The first 6 algorithms were implemented in Weka [23] with the same parameters used in previous work [11]. The last algorithm was added as it is a wellaccepted deterministic classification algorithm that was not present in the previous study. The default parameters were used within Weka’s Simple Logistic implementation of this algorithm.
As shown below, we found that the classic use of discretization often did not improve the average performance across these datasets. Instead, using discretization in addition to the original features typically yielded better performance, either through using DMIAT in addition to the original data or through using discretized features from canonical algorithms in addition to the original data. The best performance was typically reached by the combined dataset of the original data enriched with DMIAT and other discretization algorithms as posited by the research question #4. Thus, we present evidence that the most effective pipeline is using DMIAT along with discretized features to extend a given set of features.
3.1 DMIAT results
The goal of the DMIAT algorithm is to supplement, not supplant, the original data. The decision as to whether additional features will be generated depends on the supp and conf parameters in Algorithm 1 and it is applied to every numeric attribute A_{ j } within the dataset. Thus, DMIAT could potentially generate two features for every attribute based on lines 7 and 12 of the Algorithm 1. In our experiments, we defined supp equal to 10% of the training data. Three confidence values, conf, were checked. The first value was conf=Entropy(0), meaning the cut yielded a completely decisive indication for one of the target classes and thus yielded zero entropy as per the first type of confidence mentioned above. We also considered 2 types of lift confidence: conf= Lift(1.5) and conf=Lift(2.0). As per these confidence levels, we checked if the cut yielded a stronger indication for one of the target classes as measured per their lift values. Potentially, both of these lift thresholds could be satisfied. For example, assume a cut yielded a lift of 3, then both confidence thresholds would consider this cut significant and generate a new feature accordingly. We also considered the possibility that the cuts could be added cumulatively and thus overlapping cuts could be added based on combinations of these 3 different thresholds. Conversely, if all thresholds used were not met, no cuts were created for a given attribute.
The total number of new features generated by three DMIAT thresholds in the each of the 28 datasets
Dataset  Attributes  Average features from DMIAT  Range  Dataset  Attributes  Average features from DMIAT  Range 

Abalone  8  0    Penbased  16  0   
Arrhythmia  279  0    Pendigits  16  0   
Glass  9  12.9  12–15  Pima  8  3.1  3–4 
Heart  13  6.8  6–7  Satimage  36  128.2  127–130 
Ionosphere  33  26.1  22–31  Segment  19  97.6  96–10 
Iris  4  20  19–21  Sonar  60  18.2  15–20 
Jm1  21  0    Spambase  57  32.9  31–35 
Madelon  500  0    Spectrometer  102  0   
Mc1  38  12  12  Texture  40  0   
Mfeatfactors  216  0    Thyroid  21  1  1 
Mfeatfourier  76  0    Vowel  13  0   
Mfeatkarhunen  64  0    Waveform  40  39.5  37–42 
Mfeatzernike  47  0    Winequalityred  11  3.9  3–4 
Pc2  36  17.2  17–19  Winequalitywhite  4  2.7  2–4 
Comparing the accuracy of the datasets without DMIAT and with 4 variations of DMIAT with parameter values of zero entropy (MIAT 0), Lift of 1.5 (MIAT 1.5), Lift of 2 (MIAT 2) and all three discretized values (MIAT 0 1.5 2)

3.2 Using discretization in addition to the original data
Comparing the accuracy results from seven different classification algorithms within the original, baseline data (A) and the discretized data (D) formed from the Ameva, CACC, CAIM, Chi2, EF, EW, HDD, IDD, IEM, and urCAIM algorithms

Comparing the accuracy results from seven different classification algorithms within the original (A) and discretized data appended to the original data (A+D)

Despite the general support being found for the research question #3 in that using discretization to augment the original data is better than using the discretized data alone, we still note that some algorithms, particularly the Knn, C4.5 and Random Forest learning algorithms, often did not benefit from any form of discretization. We further explore these results differences, and generally the impact of discretization on all algorithms, in Section 3.4.
3.3 Using DMIAT and other discretization algorithms as additional features
Combining DMIAT with discretization features yields the highest performance when considering the 15 datasets where DMIAT added features, and on average across all 28 datasets

As can be seen from the results from the 15 datasets (top portion of Table 6), the combination of discretization algorithms almost always outperforms the original data and improvements in predictive accuracy are noted in all three combinations in the Naive Bayes, SVM, Knn, AdaBoost, C4.5, and logistic regression algorithms. The one exception seems to be the Random Forest algorithm where large performance improvements are not noted. For comparison, we also present the improvements across all 28 datasets in the bottom portion of Table 6, which include 13 datasets where there are no features generated by DMIAT. As expected, the combination of DMIAT was somewhat less significant once again demonstrating the benefit of adding the features from DMIAT. Overall, and on average across the algorithms we considered, we found that this combination was the most successful, with prediction accuracies typically improving by over 1%. Thus, we found the thesis #4 to typically be correct and using DMIAT in conjunction with existing discretization algorithms is recommended to enrich the set of features to be considered.
3.4 Discussion and future work
We were somewhat surprised that using discretization alone was not as successful as it has been previously found with such classification algorithms as Random Forests and C4.5 [3, 12–14]. We believe that differences in the datasets being analyzed is likely responsible for these gaps. As such, we believe that an important open question is to predict when discretization will be successful, given the machine learning algorithm and the dataset used for input. In contrast, we found that DMIAT yielded more stable results as it typically only improved the performance, something other discretization algorithms did not do, especially for these two learning algorithms. We are now exploring how to make other discretization algorithms similarly more stable.
We note that the pipeline described in this paper of using DMIAT and discretized features in addition to the original data was most effective with algorithms without discretization, namely the Naive Bayes, SVM, and Logistic Regression classification algorithms. Conversely, this approach was less effective with methods having a discretization component, particularly with C4.5, Random Forests, and AdaBoost algorithms. It has been previously noted that C4.5 has a localized discretization component (re–applied at each internal node of the decision tree) and thus may gain less from adding globally discretized features, which are split into the same intervals across all decision tree nodes Dougherty et al. [3]. Similarly, the decision stumps used as weak classifiers in AdaBoost perform essentially a global discretization and thus may have made it less likely to benefit from the globally discretized features added by DMIAT, something that is evident from Table 3. As can be noted from Table 6, the Random Forest algorithm benefited the least from the pipeline described in this paper– again possibly due to its discretization component. We hope to further study when and why algorithms with an inherent discretization component still benefit from additional discretization. Additionally, it seems that the Knn algorithm, despite not having a discretization component, benefits less from the proposed pipeline than other algorithms. It seems plausible this is because this algorithm is known to be particularly sensitive to the curse of dimensionality [24, 25] and thus this specific algorithm does not benefit from the proposed approach while other classification algorithms are less sensitive to it. We plan to study this complex issue in our future research.
We believe that several additional directions should also be pursued for future work. First, this study did not consider learning from neural networks and deep learning, as these algorithms were not previously considered [11]. This is due to the relatively small size of several of the datasets within this study, which made it infeasible to obtain accurate deep learning models using this approach. We are currently considering additional datasets, particularly those with larger amounts of training data to allow us to better understand how deep learning can be augmented by discretized features. In a similar vein, we believe that interconnections likely exist between some of the generated discretized features. Multivariate feature selection and/or deep learning could potentially be used to help stress these interconnections and remove features which are redundant. Second, we propose using metacognition, or the process of learning about learning [26] to allow us to learn which discretized features should be added to a given dataset. We are also studying how one could find an optimal value, or set of values, for the conf and supp thresholds within DMIAT. While this paper demonstrates that multiple DMIAT thresholds can be used in combination, and each threshold does typically improve performance, we do not claim that the thresholds used in this study represent an optimal value, or set of values, for all datasets. One potential solution to this would be to develop a metacognition mechanism for learning these thresholds for a given dataset. Similarly, it is possible that a form of metacognition could be added to machine learning algorithms as has been generally suggested within neural networks [27] to help achieve this goal.
4 Conclusions
In this paper, we present a paradigm shift in how discretization can be used. To date, discretization has typically been used as a preprocessing step which removes the original attribute values, replacing them with discretized intervals. Instead, we suggest using the features that are generated by discretization to extend the dataset by using both the discretized and nondiscretized versions of the data. We have demonstrated that discretization can often be used to generate new features which should be used in addition to the dataset’s nondiscretized features. The DMIAT algorithm we present in this paper is based on this assumption. This is because DMIAT will only discretize values with particularly strong indications based on high confidence, yet relatively low support for the target class, as it assumes that classification algorithms will also be using the original data. We also show that other canonical discretization algorithms can be used in a similar fashion, and in fact a combination of the original data with DMIAT and discretized features from other algorithms yields the best performance. We are hopeful that the ideas presented in this paper will advance the use of discretization and its application to new datasets and algorithms.
The file “run_all.m” found in this zip was used to run DMIAT in batch across all files in a specified directory.
Declarations
Funding
The work by Avi Rosenfeld, Ron Illuz, and Dovid Gottesman was partially funded by the Charles Wolfson Charitable Trust.
Authors’ contributions
AR, RI, and DG were responsible for data collection and analysis. AR and ML were responsible for the algorithm development and writing of the paper. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 S Garcia, J Luengo, JA Sáez, V Lopez, F Herrera, A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE Trans. Knowl. Data Eng. 25(4), 734–750 (2013).View ArticleGoogle Scholar
 MR Chmielewski, JW GrzymalaBusse, Global discretization of continuous attributes as preprocessing for machine learning. Int. J. Approx. Reason. 15(4), 319–331 (1996).View ArticleMATHGoogle Scholar
 J Dougherty, R Kohavi, M Sahami, et al, in Machine learning: proceedings of the twelfth international conference, volume 12. Supervised and unsupervised discretization of continuous features (Morgan Kaufmann PublishersSan Francisco, 1995), pp. 194–202.Google Scholar
 H Liu, R Setiono, Feature selection via discretization. IEEE Trans. Knowl. Data Eng. 9(4), 642–645 (1997).View ArticleGoogle Scholar
 LA Kurgan, KJ Cios, Caim discretization algorithm. IEEE Trans. Knowl. Data Eng. 16(2), 145–153 (2004).View ArticleGoogle Scholar
 L GonzalezAbril, FJ Cuberos, F Velasco, JA Ortega, Ameva: An autonomous discretization algorithm. Expert Syst. Appl. 36(3), 5327–5332 (2009).View ArticleGoogle Scholar
 FEH Tay, L Shen, A modified chi2 algorithm for discretization. IEEE Trans. Knowl. Data Eng. 14(3), 666–670 (2002).View ArticleGoogle Scholar
 P Yang, JS Li, YX Huang, Hdd: a hypercube divisionbased algorithm for discretisation. Int. J. Syst. Sci. 42(4), 557–566 (2011).MathSciNetView ArticleMATHGoogle Scholar
 CJ Tsai, CI Lee, WP Yang, A discretization algorithm based on classattribute contingency coefficient. Inf. Sci. 178(3), 714–731 (2008).View ArticleGoogle Scholar
 FJ Ruiz, C Angulo, N Agell, Idd: a supervised interval distancebased method for discretization. IEEE Trans. Knowl. Data Eng. 20(9), 1230–1238 (2008).View ArticleGoogle Scholar
 A Cano, DT Nguyen, S Ventura, KJ Cios, urcaim: improved caim discretization for unbalanced and balanced data. Soft Comput. 20(1), 173–188 (2016).View ArticleGoogle Scholar
 JL Lustgarten, V Gopalakrishnan, H Grover, S Visweswaran, in AMIA. Improving classification performance with discretization on biomedical datasets (American Medical Informatics Association (AMIA)Bethesda, 2008).Google Scholar
 JL Lustgarten, S Visweswaran, V Gopalakrishnan, GF Cooper, Application of an efficient bayesian discretization method to biomedical data. BMC Bioinformatics. 12(1), 309 (2011).View ArticleGoogle Scholar
 DM Maslove, T Podchiyska, HJ Lowe, Discretization of continuous features in clinical datasets. J. Am. Med. Inform. Assoc. 20(3), 544–553 (2013).View ArticleGoogle Scholar
 A Rosenfeld, DG Graham, R Hamoudi, R Butawan, V Eneh, S Khan, H Miah, M Niranjan, LB Lovat, in 2015 IEEE International Conference on Data Science and Advanced Analytics, DSAA. MIAT: A novel attribute selection approach to better predict upper gastrointestinal cancer (Campus des CordeliersParis, 2015), pp. 1–7.Google Scholar
 I Guyon, A Elisseeff, An introduction to variable and feature selection. J. Mach. Learn. Res. 3(Mar), 1157–1182 (2003).MATHGoogle Scholar
 I Guyon, An introduction to variable and feature selection. J. Mach. Learn. Res. 3:, 1157–1182 (2003).MATHGoogle Scholar
 Y Saeys, I Inza, P Larrañaga, A review of feature selection techniques in bioinformatics. Bioinformatics. 23(19), 2507–2517 (2007).View ArticleGoogle Scholar
 RA Hamoudi, A Appert, et al, Differential expression of nfkappab target genes in malt lymphoma with and without chromosome translocation: insights into molecular mechanism. Leukemia. 24(8), 1487–1497 (2010).View ArticleGoogle Scholar
 Z Zheng, R Kohavi, L Mason, in Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. Real world performance of association rule algorithms (ACMNew York, 2001), pp. 401–406.View ArticleGoogle Scholar
 J AlcaláFdez, A Fernández, J Luengo, J Derrac, S García, L Sanchez, F Herrera, Keel datamining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Mult. Valued Log. Soft. Comput. 17:, 255–287 (2011).Google Scholar
 M Lichman, UCI Machine Learning Repository (University of California, School of Information and Computer Science, Irvine, 2013). http://archive.ics.uci.edu/ml.Google Scholar
 IH Witten, E Frank, Data Mining: Practical Machine Learning Tools and Techniques, Second Edition, Morgan Kaufmann Series in Data Management Systems (Elsevier, Cambridge, 2005).Google Scholar
 JsH Friedman, et al., Flexible metric nearest neighbor classification.Technical report, Technical report (Department of Statistics, Stanford University, 1994).Google Scholar
 DW Aha, Editorial. Artif. Intell. Rev. 11:, 7–10 (1997).View ArticleGoogle Scholar
 C Watkins, Learning about learning enhances performance (Institute of Education, University of London, 2001).Google Scholar
 R Savitha, S Suresh, N Sundararajan, Metacognitive learning in a fully complexvalued radial basis function neural network. Neural Comput.24(5), 1297–1328 (2012).MathSciNetView ArticleGoogle Scholar