Incremental Local Linear Fuzzy Classiﬁer in Fisher Space

Optimizing the antecedent part of neurofuzzy system is an active research topic, for which di ﬀ erent approaches have been developed. However, current approaches typically su ﬀ er from high computational complexity or lack of ability to extract knowledge from a given set of training data. In this paper, we introduce a novel incremental training algorithm for the class of neurofuzzy systems that are structured based on local linear classiﬁers. Linear discriminant analysis is utilized to transform the data into a space in which linear discriminancy of training samples is maximized. The neurofuzzy classiﬁer is then built in the transformed space, starting from the simplest form (a global linear classiﬁer). If the overall performance of the classiﬁer was not satisfactory, it would be iteratively reﬁned by incorporating additional local classiﬁers. In addition, rule consequent parameters are optimized using a local least square approach. Our reﬁnement strategy is motivated by LOLIMOT, which is a greedy partition algorithm for structure training and has been successfully applied in a number of identiﬁcation problems. The proposed classiﬁer is compared to several benchmark classiﬁers on a number of well-known datasets. The results prove the e ﬃ cacy of the proposed classiﬁer in achieving high performance while incurring low computational e ﬀ ort.


Introduction
Both fuzzy logic and neural networks include approaches to human-like reasoning that utilize the human tolerance for incompleteness, uncertainty, imprecision, and fuzziness in a decision making process.Fuzzy logic is a key tool to express the knowledge of domain experts so that valuable experience of humans can be incorporated into the system design.The neural network is an information processing system with the ability to learn from training data.The learning capability of neural networks makes them an appropriate choice for combination with fuzzy systems in order to automate or support the process of developing a fuzzy system for a given task [1].In this view, neurofuzzy systems have been introduced and widely investigated [2].A neurofuzzy system is a fuzzy system that is trained by a learning algorithm derived from neural network theory.The learning procedure is performed by interleaving the optimization of the antecedent and consequent part parameters.The performance of a neurofuzzy system is largely influenced by structure learning which involves two major issues: (i) parameter tuning of the antecedent part, which provides us with the fuzzy partitioning of the input space.(ii) Parameter tuning of consequent part in which the parameters of consequent functions are obtained.Each subspace together with its associated consequent function is used to characterize a corresponding fuzzy rule [3].Generally, the local models (consequent functions) are chosen to be linear, which yields local linear model structures [4].
Recently, neurofuzzy systems have found extensive applications in pattern recognition [5][6][7].In this context, several techniques for deriving fuzzy rules from training data such as fuzzy clustering and partitioning-based methods have been proposed.The fuzzy clustering-based methods search the input space for clusters, which are then projected to each dimension of input space to gain fuzzy rules with better interpretability.This approach encompasses a variety of algorithms such as Kohonen learning rule, hyperbox method, product-space partitioning, and fuzzy C-mean method [8].Examples of partitioning-based methods are NEFCLASS and NEFCAR, which start with a large number of partitions.These partitions are then pruned to select the best-performing fuzzy rules [1,7,9].For a detailed discussion on neurofuzzy rule generation algorithms, the reader is referred to [10][11][12][13][14].
This study proposes a novel incremental technique for structure optimization of local linear neurofuzzy classifiers.The proposed neurofuzzy classifier is built starting from the most generic and simplest form (a global linear classifier).If the overall performance of the classifier was not satisfactory, it would be iteratively refined by incorporating additional local classifiers.Proposed refinement strategy is motivated by LOLIMOT, which is a greedy partition algorithm for structure training of local linear neurofuzzy models that determine the (sub) optimal partitioning of input space by axis-orthogonal splits [15,16] and has found extensive applications in identification problems due to fast implementation and high accuracy.Adoption of LOLIMOT algorithm to classification requires inevitable modifications.Conventional LOLIMOT is restricted to axis-orthogonal splits and is unable of handling high-dimensional data.We address these problems by employing a well-known statistical stage, namely, linear discriminant analysis (LDA).Therefore, antecedent structure of neurofuzzy classifier is built in the transformed (and if needed reduced) input space by axis-orthogonal splits.Moreover, for proper adoption of LOLIMOT algorithm to classification, a novel interpretation of error is introduced.Once the antecedent parameters are determined, rule consequent parameters are efficiently estimated using a local least square approach.To assess the performance of the proposed method, results are compared with conventional classifiers (neural networks, linear Bayes, and quadratic Bayes), neurofuzzy classifiers (NEFCLASS and FuNe I), piecewise linear classifiers, and decision trees (C4.5).Experimental results on several well-known datasets demonstrate that, in most cases, our algorithm outperforms state-of-the-art classifiers and significantly improves the classification results.The rest of this paper is organized as follows.In Section 2, local linear neurofuzzy classifiers are introduced and common approaches for antecedent and consequent parameter optimization are discussed.In Section 3, our proposed classifier is developed.Section 4 is dedicated to assessment of the proposed algorithm and the paper is concluded in Section 5.

Local Linear Neurofuzzy Classifier
A neurofuzzy system with multiple outputs can be realized either by a single SIMO or MIMO model or by a bank of SISO or MISO models [17].In the current study, the former approach is pursued as it often requires fewer neurons [16].Assume a set of input/label pairs {U, Y }, where U ∈ R p and Y ∈ {0, 1} K .In the case of two-class problems, it is most convenient to use the binary representation, in which there is a single target variable y ∈ {0, 1} such that y = 1 represents first class and y = 0 represents the other class.When facing a K-class problem, it is often convenient to use a 1-of-K coding scheme in which the label Y is a vector of length K such that if the class is C j , then all elements of Y are zero except its jth element denoted by (Y ) j , which takes the value 1.The elements of label vector Y can be interpreted as posterior probabilities of corresponding classes, with the values of probability taking only the extreme values of 0 and 1.Therefore, we wish to predict discrete class labels, or more generally posterior probabilities that lie in the range (0, 1).This is achieved by introducing an activation function f (•) [18] to limit the output of the model so that it falls into (0, 1).The choice of activation function is usually logistic sigmoid (K = 2 classes) or softmax (K ≥ 2 classes).Decision is made by assigning each test sample to the class with maximum posterior probability.The network architecture of a neurofuzzy classifier, structured based on local linear classifiers, is depicted in Figure 1, where the rule antecedent inputs Z ∈ R nz and the rule consequent inputs X ∈ R nx are subsets of the input samples U.Each neuron i = 1, . . ., M of the model realizes a fuzzy rule: where A i, j is the jth fuzzy set defined on ith input and W i ∈ R nx×K .Each neuron or rule represents K local linear classifiers (LLCs) and an associated validity (weighting) function that determines the region of validity of those LLCs.For a reasonable interpretation of local classifiers it is furthermore necessary that the validity functions sum up to one for any antecedent input Z.The output of the local linear neurofuzzy classifier would be where Y i denotes the output of local models and ϕ i (•) is interpreted as weighting function, i = 1, . . ., M. Thus, the output of the model is obtained by applying f (•) to the weighted sum of the outputs of the LLCs.In other words, the model interpolates between local models by weighting functions.In the following, the validity functions ϕ i (•) are chosen to be normalized Gaussians: where C i ∈ R nz is the center of ith membership function and i ∈ R nz×nz is a diagonal matrix containing variances of individual dimensions, that is Here, we will assume the most general case for X and Z, where X = Z = U.It should be pointed out that in contrast to the models used for identification, discussed neurofuzzy classifier will be no longer linear in the consequent parameters due to the presence of f (•).This will lead to more analytical and computational complexities than for identification models.However, at the expense of losing the probabilistic point of view, we can omit the nonlinear activation function as in [19].A test sample is then assigned to the class with maximum activation value and the classifier would be linear in consequent parameters.Optimization of rule antecedent structure and rule consequent parameters is discussed in the following sections.

Rule Consequent Parameters.
Rule consequent parameters are interpreted as parameters of local classifiers.Due to linearity assumption for activation function, the neurofuzzy classifier presented by ( 2) is linear in consequent parameters.Therefore, these parameters can be efficiently estimated from training patterns using a least square approach, provided that the rule antecedent structure is given.Simultaneous optimization of all consequent parameters (global optimization) yields the best results in the sense of least mean square error but involves extreme computational effort.Alternatively, we can use local estimation approach presented in [15], which neglects the overlap between the validity functions and estimates the parameters of each rule separately.This approach is computationally more efficient than global estimation.The cost, however, is the introduction of a bias error while, on the other hand, the variance error (and the effect of over-fitting) is decreased and more robustness to noise is gained.In this paper, the local estimation approach is pursued and is described as follows.Instead of estimating all M × K × (p + 1) consequent parameters simultaneously (as in global estimation), M local estimations are carried out for the K × (p + 1) parameters of each neuron.Note that the parameter matrix associated with ith LLC is W i ∈ R (p+1)×K and that the contribution of ith LLC to the output vector The contribution of ith LLC is dominant only in the region where the associated validity function ϕ i (•) is close to one (which happens near the center of ϕ i (•)).Training samples in this region are highly relevant for the estimation of W i .Therefore, local estimation of W i can be achieved by performing the following weighted least square optimization: where U j denotes the jth input sample, j = 1, . . ., N. This optimization is equivalent to fitting a linear classifier to weighted training data.Let the target matrix Υ ∈ R N×K , the regression matrix X ∈ R N×(p+1) , and the weighting matrix Q i ∈ R N×N be defined as follows: Then, it can be simply verified that the optimum W i ∈ R (p+1)×K that minimizes (4) is obtained as

Rule Antecedent Structure.
Training of the antecedent parameters is a nonlinear optimization task, which provides us with the proper partitioning of the input space.Two common strategies for antecedent structure optimization are clustering-and partitioning-based techniques.In order to embed data-driven knowledge in a neurofuzzy system, clustering methods such as Fuzzy RuleNet [20] utilize cluster vectors extracted from the input dataset to initialize the centers of fuzzy rules.A learning algorithm is then applied to fine tune the rules based on the available training data.These approaches usually search for hyperellipsoidal or hyperrectangular clusters in input space and are shown to typically produce rules which are hard to interpret [21,22].
Partitioning-based methods such as NEFCLASS [9] divide the input space into finer regions by grid partitioning.Each partition is supposed to represent an if-then rule.These rules are then pruned using some heuristics.Finally, membership functions are defined using only best performing rules.In other words, NEFCLASS does not induce fuzzy classification rules by searching for clusters, but by modifying the fuzzy partitionings defined on each single dimension.Evidently, partitioning-based approaches are computationally expensive [23].

Proposed Algorithm for Structure Optimization
Tuning of the antecedent parameters of a neurofuzzy system is a nonlinear optimization task, which provides EURASIP Journal on Advances in Signal Processing us with the proper input space partitioning.Some commonly used strategies for antecedent parameter optimization were discussed in Section 2.1.In addition, optimization of consequent parameters using local least square approach was discussed in Section 2.2.The proposed algorithm for structure optimization increases the complexity of the local linear neurofuzzy classifier during the training phase.Hence, it starts with a coarse partitioning of the input space, which is then refined by increasing the resolution of the input space partitioning.The proposed algorithm is based on divideand-conquer principle.This principle is widely used to attack complex problems by dividing them into simpler classification tasks whose resulting local classifiers are then combined to obtain a global classifier which also generalizes well [24].
Our strategy for input space partitioning is motivated by LOLIMOT, which is a local linear neurofuzzy algorithm that uses axis-orthogonal splits to avidly partition the input space for the rule antecedent parameter and structure training.
LOLIMOT has been successfully applied in a number of identification problems and gained significant attention due to simple and fast optimization of rule antecedent parameters.Computational complexity of LOLIMOT grows linearly with the number of neurons and cubically with the number of consequent parameters of each neuron.This level of computation complexity is quite favorable [16].As will be discussed shortly, adoption of LOLIMOT algorithm to classification requires inevitable modifications.One of the most severe restrictions of LOLIMOT is the axis-orthogonal partitioning of the input space.This restriction, while being crucial for interpretation as a fuzzy system and for the development of an extremely efficient construction algorithm, leads to the following shortcomings.(i) Improper splitting of input space, which frequently happens when optimal partitioning of input space does not align with axis-orthogonal directions.In such cases, the nonlinearity of data in the original input space does not stretch along the input space axes and hence LOLIMOT cannot efficiently determine proper input partitioning [25].(ii) Curse of dimensionality, which often plagues fuzzy systems in real-world applications.Fuzzy methods, which are computationally manageable in low-dimensional spaces, can become completely impractical in high-dimensional spaces.Since at each iteration, LOLIMOT tries all divisions of worst LLC to decide about further refinement, curse of dimensionality will be more prohibitive.Several techniques have been proposed to address these two drawbacks.For example, Nelles developed an axis-oblique decomposition algorithm, which suffers from computational concerns [25].In addition, using different input spaces for rule antecedent and consequent was suggested in [16], which could result in the alleviation of computational efforts.Evidently, adoption of LOLIMOT to classification confronts the above shortcomings, especially when the discriminancy of classes is small in the original axes.We suggest using a computationally cheap, easy to implement statistical stage, namely, LDA (also known as Fisher discriminant analysis), which alleviates the mentioned problems by rotating the original axes, so that the linear discriminancy of training samples along the new axes is maximized in a global sense [8].The basic concept of LDA is to seek for the most efficient projective directions which minimize the scattering of samples in each class and maximize the distance of different classes.In addition, LDA is capable of selecting the best linear combinations of input features for classification and hence can be used for dimensionality reduction.Therefore, axisorthogonal partitioning of the transformed input space (building the structure in the transformed space) often significantly reduces the complexity of antecedent structure, as well as computational cost.LDA is formally described as follows.Consider the sample set {X i, j }, where i = 1, . . ., I denotes the class to which X i, j belongs and j = 1, . . ., N i denotes the index of the sample X i, j in the corresponding class.Now, between-class scatter matrix S B is introduced as where denotes the sample mean in class I i and X = (1/N) I i=1 Ni j=1 X i, j is the global sample mean.Similarly, within-class scatter matrix is defined as LDA then searches such optimal subspace projections which minimize the trace of the resulting within-class scatter matrix, while maximize the trace of the between-class scatter matrix.In other words, the selected features are eigenvectors of (S W ) −1 S B corresponding to largest eigenvalues.Another, perhaps more popular, unsupervised alternative to LDA is principal component analysis (PCA), which minimizes the information loss upon projection to the lower dimensional space.Since PCA minimizes the reconstruction error, classification based on PCA generally achieves lower performance compared to LDA, as verified in Section 4.1.For a detailed discussion on linear dimensional reduction techniques, the interested reader is referred to [8,26,27].
Another practical concern for adoption of LOLIMOT to classification is the error interpretation.Training of local linear models in an LOLIMOT system is achieved by minimizing the local loss function defined in (4).This loss function is also used for comparison of local models.In this paper, a novel interpretation of error is introduced.While the loss function of ( 4) is used to train the LLCs, we suggest using a different error index for comparison of LLCs, which is based on percentage error rather than l 2 -norm ( • 2 ) of the classification error.The percentage error resembles the l 1 -norm of error ( • 1 ), which is shown to gain better classification results than l 2 -norm of error [28].Through our experiments, it was found that this interpretation of error improves the classification results.
Finally, note that the standard deviation of validity functions is selected to be proportional to extension of corresponding hyperrectangle.In the current study, this proportionality factor is fixed and assumed to be 1/3 [16].The proposed algorithm can be summarized as follows.
(1) Finding the most discriminative basis: apply LDA in order to find the most discriminative basis.If needed, dimension reduction is also realized in this step by keeping only the most discriminative features in the new basis.The antecedent structure is built in this transformed space.
(2) Start with an initial model: use any prior knowledge to construct the validity functions in the transformed initial input space partitioning.If no input space partitioning is available a priori, then set M = 1 and start with a single LLC.
(3) Compare LLCs to find the worst LLC: use the following equation to calculate the error index l for all LLCs, in which each misclassified pattern is assigned to the LLC with largest degree of validity.Then, the LLC with maximum error index l is selected as the worstperforming, which is denoted by LLC b : where where, N(ξ) denotes the number of elements of vector ξ.(6) Test for convergence: if the termination criterion (e.g.convergence of performance) is not met, go to step 2.
In the next section, the efficacy of the proposed framework is experimentally studied on several datasets.In addition, to provide a better insight into the procedure, operation of the algorithm will be graphically illustrated.

Experiments
This section presents the classification results of the proposed method on several well-known datasets.The error rates of the proposed classifier are compared to that of a number of existing pattern classification algorithms.To this end, four datasets from ELENA project [29], namely, Iris CR, Phoneme CR, Satimage CR, and Texture CR, and two datasets from UCI machine learning repository [30], namely, Wisconsin breast cancer and Sonar are selected.ELENA project and UCI machine learning repository are resources of databases for testing and benchmarking pattern classification algorithms.Main features of these datasets are summarized in Table 1.
The CR affix in the names of datasets from ELENA project indicates that the datasets are initially preprocessed by a normalization routine to center each feature and enforce unit variance.In our experiments on these datasets, we follow a similar technique to [31].First, each dataset is partitioned into two equal random sets: one for training and the other for test phase.Then, the roles of two halves are reversed.To achieve more accurate results, experiments are repeated 20 times and the average error rate is reported.

Role of LDA in Preprocessing.
In this subsection, the role of LDA in the preprocessing phase is experimentally studied.Our first experiment is conducted on Iris CR dataset.To be able to visualize the results, PCA [8] is first applied to reduce the number of features to two.Then, the proposed algorithm is applied to obtain the partitioning of the input space for all samples in Iris CR dataset.Using rectangles to show the validity region of the corresponding LLC's, Figures 2(a) and 2(b) depict the obtained partitioning without and with LDA preprocessing, respectively.Therefore, this illustration provides a comparison between PCA and LDA in the preprocessing phase.It is observed that there exists a partition in Figure 2(a), for which it is impossible to find a linear classifier that correctly classifies all samples in the corresponding partition, whereas this situation does not occur in Figure 2(b), in which splitting directions are not axis-orthogonal, but are selected to maximize the linear discriminancy of the samples.This illustrative example implies that, with the same number of partitions, LDA generally provides a better partitioning of the input space compared to PCA. Figure 2(b) also provides a valuable insight into the process of input space partitioning by the proposed algorithm.
As our second experiment, the advantage of using LDA preprocessing is quantified in terms of the performance of the proposed classifier on a number of datasets, namely, Iris CR, Satimage CR, Texture CR, and Phoneme CR.Table 2 lists average classification error rates of the proposed classifier, using PCA or LDA in the preprocessing phase.In accordance with our expectation, using LDA leads to better classification performance on all datasets.and of some conventional classifiers, namely, neural network, linear Bayes, and quadratic Bayes on several datasets, as reported in [31].It shall be noted that, in [31], each classifier has been reasonably optimized with regards to parameter settings and available features.In addition, an earnest effort was made to optimize each individual classifier with respect to selecting good values for the parameters which govern its performance.Moreover, feature selection techniques have been applied to feed each classifier with best features.[9].FuNe I, on the other hand, has a five-layer feedforward structure and restricts itself to rules with one or two antecedents.Fuzzy rules are then learned by a special training network, that helps to identify suitable combinations of one or two variables as antecedents.These rules are then trained to find suitable fuzzy sets for the rules [34,35].Performance of these classifiers and of the proposed method is compared on Iris CR dataset.For NEFCLASS and FuNe I, the number of rules was limited to ten and seven, respectively.Reported error rates of NEFCLASS and FuNe I are 3.33% and 4%, respectively, while our method achieves the error rate of 2.33%, with seven partitions.It shall be pointed out that, in contrast to the proposed classifier, NEFCLASS and FuNe I suffer from high-computational complexity [35].

Comparison with Piecewise Linear
Classifiers.Piecewise linear classifiers approximate the complex decision boundaries with piecewise linear functions.Recently, Kostin presented a simple and fast piecewise linear classifier which demonstrated comparable (even superior in many cases) results with many well-known benchmark classifiers [32].Kostin's classifier is based on simple calculation of centroids of classes and the creation of a binary partition tree of class centroids, which is then used to sequentially construct piecewise linear boundaries for each nonleaf node of the partition tree [32].As was the case in our classifier, complexity of the classifier is sequentially increased until satisfactory performance is achieved.
In this subsection, due to similar essence and properties, performance of the Kostin's classifier [32], as a representative member of piecewise linear classifiers, is compared with the proposed classifier.The average classification error rates for two methods are listed in Table 4.As indicated by the results, the proposed classifier achieves a better performance compared to Kostin's classifier on five datasets, with slightly worse performance on the Phoneme CR dataset.This improvement is intuitively explained by noting that, with the same complexity, natural datasets are generally better expressed by space grids rather than hyperplanes.Furthermore, the fuzzy nature of the decision making process in the proposed classifier may be regarded as an advantage over crisp decision boundaries involved in [32].

Comparison with Decision Tree
Classifiers.Decision tree algorithms are regarded as a powerful classification tool in machine learning society, which have appeared quite influential in practice [33].This classifiers are constructed in a form of a decision tree, in which each nonleaf node tests a function of some attributes.An unknown pattern is then classified by making consecutive decisions starting from the root until reaching a leaf node.Clearly, proper selection of the test functions and associated attributes at each node are vital for successful application of decision tree classifiers.Among several choices, C4.5 is utilized in our experiments as a successful and popular decision tree classifier [33].Using two-way splits for numeric attributes in the creation of the decision tree, C4.5 examines a family of possible tests at each node and selects the one which maximizes the value of some splitting criterion.Once the tree is built, a pruning procedure is performed to avoid overfitting and excessive complexity.
Due to similar essence and characteristics, comparison of the proposed classifier and decision tree classifiers is inevitable.Therefore, in this subsection, performance of the proposed classfier is compared with C4.5, as a representative member of the decision tree classifiers.Table 4 lists average classification error rates of the proposed method as well as C4.5 classifier.As indicated by Table 4, except for the Phoneme CR dataset, the proposed classifier outperforms C4.5.

Conclusion
In this study, a simple and computationally efficient local linear neurofuzzy classifier has been introduced, implemented, and tested on a number of well-known datasets.The structure of the antecedent part is obtained during the training phase and is data-driven rather than knowledge based.Input space is first transformed by LDA, so that the linear discriminancy of training samples is maximized.The antecedent structure is then built in the transformed space by axis-orthogonal splits.At each iteration, the local linear classifier with the worst error index is split into two new rules which are then included in the classifier.In addition, the rule consequent parameters are optimized using a local least square approach.The simplicity and speed are the main advantages of the proposed classifier.Together with high performance, this classifier is a good choice for many EURASIP Journal on Advances in Signal Processing applications in which the use of more sophisticated classifiers can be impractical.

( 4 )( 5 )
Check all divisions of worst LLC: consider the LLC b for further refinement.The hyperrectangle of this LLC is partitioned into two halves with an axis-orthogonal split.Divisions in all dimensions are considered.For each of the p divisions, the following steps are taken.(a) Construct the membership functions for both hyperrectangles.(b) Construct all validity functions.(c) For both newly generated LLCs, weigh the training samples with corresponding validity functions and fit a linear classifier to these weighted samples by minimizing local loss function defined in (4) (local optimization of the rule consequent parameters for both newly generated LLCs).(d) Calculate the percentage error of classification for the current overall model.Find the best division: select the best of the p alternatives checked in step 4. The validity functions constructed in step 4a and the LLCs optimized in step 4c are included in the classifier.The number of LLCs is increased by one.

Figure 2 :
Figure 2: Application of the proposed algorithm for partitioning of the input space in Iris CR dataset, (a) without and (b) with LDA in the pre-processing phase.

Table 1 :
Main features of datasets used in our experiments.

Table 2 :
Error rates in percentage for the proposed classifier using PCA or LDA preprocessing.The best results are highlighted in boldface.

Table 3 :
Error rates in percentage for conventional and proposed classifiers on several datasets.The best results are highlighted in boldface.

Table 4 :
[33]r rates for piecewise linear classifier[32], C4.5 decision tree[33]and the proposed classifier on several datasets.The best results are highlighted in boldface.
Table3indicates that, both on Iris CR and Satimage CR datasets, the proposed technique outperforms other classifiers.On Texture CR dataset, our classifier outperforms the neural network classifier, achieves results comparable to the linear Bayes classifier, and is slightly worse than the quadratic Bayes classifier.On the other hand, on Phoneme CR dataset, our classifier outperforms both linear and quadratic Bayes classifiers and achieves results worse than the neural network classifier.These results suggest that the proposed simple local linear fuzzy classifier could be quite successful compared to these conventional classifiers.Finally, note that the proposed classifier typically achieves better results in comparison with the neural network classifier, to which it can be regarded as a close relative.
4.3.Comparison with Other NeurofuzzyClassifiers.NEF-CLASS and FuNe I are two well-known neurofuzzy classifiers.NEFCLASS starts with a large number of partitions in the input space, which, as mentioned in Section 2.2, are pruned to select the best-performing fuzzy rules