A Novel Prostate Cancer Classiﬁcation Technique Using Intermediate Memory Tabu Search

The introduction of multispectral imaging in pathology problems such as the identiﬁcation of prostatic cancer is recent. Unlike conventional RGB color space, it allows the acquisition of a large number of spectral bands within the visible spectrum. This results in a feature vector of size greater than 100. For such a high dimensionality, pattern recognition techniques su ﬀ er from the well-known curse of dimensionality problem. The two well-known techniques to solve this problem are feature extraction and feature selection. In this paper, a novel feature selection technique using tabu search with an intermediate-term memory is proposed. The cost of a feature subset is measured by leave-one-out correct-classiﬁcation rate of a nearest-neighbor (1-NN) classiﬁer. The experiments have been carried out on the prostate cancer textured multispectral images and the results have been compared with a reported classical feature extraction technique. The results have indicated a signiﬁcant boost in the performance both in terms of minimizing features and maximizing classiﬁcation accuracy.


INTRODUCTION
Prostate cancer has become the second most commonly diagnosed cancer in the male population after lung cancer, with approximately 22 800 new cases diagnosed every year in the UK alone.Currently, prostate needle biopsy remains the only conclusive way to make an accurate diagnosis of prostate cancer [1].Recently Roula et al. have described a novel approach in which additional spectral data is used for the classification of prostate needle biopsies [2,3].The aim of this novel approach is to help pathologists reduce the diagnosis error rate.Instead of analyzing conventional grey scale or RGB color images, spectral bands have been used in the analysis.Results have shown that the multispectral image classification outperforms both RGB and grey-level-based classification.The following four classes have been discriminated.
Figure 1 shows samples of the four classes.
In their research [2,3], the total number of features used is greater than 100.For such a high-dimensionality problem, pattern recognition techniques suffer from the wellknown curse-of-dimensionality problem [4] as the number of training samples that are used to design the classifier is small relative to the number of features.One way to overcome the curse-of-dimensionality problem is to reduce the dimensionality of the feature space.Principal component analysis (a well-known feature extraction method) has been used by Roula et al. on the large resulting feature vectors to reduce its dimensionality to a manageable size.The classification tests have been carried out using supervised classical discrimination method [5].Although Roula et al. have achieved a classification accuracy of over 94% in their experiments, this is still unacceptable in medical applications.This work attempts to further improve this classification accuracy.
The other way to reduce the dimensionality of the feature space is by using feature selection methods.The term feature selection refers to algorithms that select the best subset of the input feature set.These algorithms used in the design of pattern classifiers have three goals: (1) to reduce the cost of extracting features, (2) to improve the classification accuracy, and (3) to improve the reliability of the estimation of performance [4,6].Feature selection leads to savings in the measuring features (since some of the features are discarded) and the selected features retain their original physical interpretation [4].This feature selection problem can be viewed as a multiobjective optimization problem since it involves minimizing feature subset and maximizing classification accuracy.
Mathematically, the feature selection problem can be formulated as follows.Suppose X is an original feature vector with the cardinality of n and X is the new feature vector with the cardinality of n, X ⊆ X, J(X) is the selection criterion function for the new feature vector X.The goal is to optimize J().
This feature selection problem is NP-hard problem.Therefore, the optimal solution cannot be guaranteed to be acquired except when performing an exhaustive search in the solution space [7].However, exhaustive search is feasible only for small n.Different algorithms have been proposed for feature selection to obtain near-optimal solutions [4,6,8,9,10,11].The choice of an algorithm for selecting the features from an initial set depends on n.The feature selection problem is of small scale, medium scale, or large scale if n belongs to [0, 19], [20,49], or [50, ∞], respectively [6,10].Sequential algorithms such as sequential forward floating search (SFFS) and sequential backward floating search (SBFS) are efficient and usually find fairly good solutions for small-and medium-scale problems [9].But they suffer from the problem of trapping into local optimal solutions for large-scale problems when n > 100 [6,10].Modern iterative heuristics such as tabu search and genetic algorithms have been found effective in tackling this category of problems which have an exponential and noisy search space with numerous local optima [10,11,16].
Recently Zhang et al. have used tabu search with shortterm memory to solve the optimal feature selection problem.The experimental results on synthetic data have shown that the tabu search not only has a high possibility to obtain the optimal or near-optimal solution, but also requires less computational time than the other suboptimal and genetic algorithm methods [10].
The aim of this paper is to improve the classification error rate of prostate needle biopsies for the four groups mentioned above while removing as many features as possible from the original feature vector.Tabu search method is used to find a feature subset.A novel approach based on intermediate-term memory has been developed to improve the search through tabu search.Due to multiobjective nature of the problem and the various cost parameters, that is, minimizing the number of features, maximizing the classification accuracy, and so forth, it is not clear what constitutes the best solution.Therefore, the objective function is measured by using the fuzzy formula, since fuzzy logic provides a suitable mathematical framework to address such problem.Classification accuracy is measured by leave-one-out correctclassification rate of a nearest neighbor (1-NN).Neighbors are calculated using the Euclidean distance.The 1-NN classifier is simple and it always provides a reasonable classification performance in most applications.The most straightforward 1-NN rule can be conveniently used as a benchmark for all other classifiers.Further, as the 1-NN classifier does not require any user-specified parameters, its classification results are implementation independent [4].
The rest of the paper is organized as follows.Section 2 gives an overview of tabu search and fuzzy logic followed by proposed tabu search design using intermediate-term memory for feature selection problem.The experiments and the comparison with other reported classifiers are presented in Section 4. Section 5 concludes the paper.

Tabu search
Tabu search (TS) was introduced by Glover [17,18] as a general iterative metaheuristic for solving combinatorial optimization problems.Tabu search is conceptually simple and elegant.It is a form of local neighborhood search.Each solution S ∈ Ω has an associated set of neighbors N(S) ⊆ Ω, where Ω is the set of feasible solutions.A solution S ∈ N(S) can be reached from S by an operation called a move to S .TS moves from a solution to its best admissible neighbor, even if this causes the objective function to deteriorate.To avoid cycling, solutions that have been recently explored are declared forbidden or tabu for a number of iterations.The tabu status of a solution is overridden when certain criteria (aspiration criteria) are satisfied.Sometimes, intensification and diversification strategies are used to improve the search.In the first case, the search is accentuated in the promising regions of the feasible domain.In the second case, an attempt is made to consider solutions in a broad area of the search space.The tabu search algorithm is given in Algorithm 1, where (i) Ω is the set of feasible solutions, (ii) S is the current solution, (iii) S * is the best admissible solution, (iv) Cost is the objective function, (v) N(S) is the neighborhood of solution S, (vi) V * is the sample of neighborhood solutions, (vii) T is the tabu list, (viii) AL is the aspiration level.

Fuzzy logic
Fuzzy set theory has been recently applied in many areas of science and engineering [20].In the most practical situations, one is faced with several concurrent objectives.Classic approaches usually deal with such difficulty by computing a single utility function as a weighted sum of the individual objectives, where more important objectives are assigned higher weights.Balancing different objectives by weight functions is at best controversial.Fuzzy logic is a convenient vehicle for trading off different objectives.It allows to map values of different criteria into linguistic values, which characterize the level of satisfaction of the designer with the numerical value of objectives, and to operate over the interval [0, 1] defined by the membership functions for each objective.

Fuzzy objective function
In this paper, we present a tabu search algorithm, where the quality of a solution is characterized by a fuzzy logic rule expressed in linguistic variables of the problem domain.Three linguistic variables are defined to correspond to the three component objective functions: number of features f 1 , number of incorrect predictions f 2 , and average classification error rate f 3 .One linguistic value is defined for each component of the objective function.These linguistic values characterize the degree of satisfaction of the designer with the values of objectives f i (x), i = {1, 2, 3}.These degrees of satisfaction are described by the membership functions µ i (x) on fuzzy sets of the linguistic values.The membership functions for the minimum number of features, the minimum number of incorrect predictions, and the low classification error rate are easy to build.They are assumed to be nonincreasing functions because the smaller the number of features f 1 (x), the number of incorrect predictions f 2 (x), and the classification error rate f 3 (x), the higher the degree of satisfaction µ 1 (x), µ 2 (x), and µ 3 (x) of the expert system (see Figure 2).The fuzzy subset of a good solution is defined by the following fuzzy logic rule: "if a solution has small number of features and small number of incorrect predictions and low classification error rate, then it is a good solution." According to the and-like/or-like ordered weighted averaging logic [21,22], the above rule evaluates to the following: where γ is a constant in the range [0, 1].The shape of the membership function µ(x) is shown in Figure 2. Membership of data in a fuzzy set is defined using values in the range [0, 1].The membership values for the number of features F, the number of incorrect predictions P, and the classification error rate E are computed by using ( 2), (3), and (4), respectively: The maximum number of features (F Max) is the size of the feature vector and the minimum number of features (F Min) is 1.The maximum number of incorrect predictions (P Max) and the maximum classification error rate (E Max) are determined by applying 1-NN classifier in the initial solution.The minimum number of incorrect predictions (P Min) is 0 while the minimum classification error rate (E Min) is 0%.

Initial solution
Feature selection vector is represented by a 0/1-bit string where 0 shows the feature is not included in the solution while 1 shows the feature is included.All features are included in the initial solution.

Neighborhood solutions
Neighbors are generated by randomly adding or deleting a feature from the feature vector of size n.For example, if 11 001 is the current feature vector, then the possible neighbors with a candidate list size of 3 can be 10 001, 11 101, and 01 001.Among the neighbors, the one with the best cost (i.e., the solution which results in the minimum value of ( 1)) is selected and considered as a new current solution for the next iteration.

Tabu moves
A tabu list is maintained to avoid returning to previously visited solutions.With this approach, if a feature (move) is added or deleted at iteration i, then adding or deleting the same feature (move) for T iterations (tabu list size) is tabu.

Aspiration criterion
Aspiration criterion is a mechanism used to override the tabu status of moves.It temporarily overrides the tabu status if the move is sufficiently good.In our approach, if a feature is added or deleted at iteration i and this move results in a best cost for all previous iterations, then this feature is allowed to add or delete even if it is in the tabu list.

Termination rule
The most commonly used stopping criteria in TS are (i) after a fixed number of iterations, (ii) after some number of iterations without an improvement in the objective function value, (iii) when the objective reaches a prespecified objective value.
In our algorithm, termination condition is a predefined number of iterations.

Intensification of the search
For intensification, the search is accentuated in the promising regions of the feasible domain.Intensification is based on some intermediate-term memory.Since, the solution space is extremely large (initial feature vector n > 100), it is important to intensify the search in the promising regions by removing poor features from the search space.The following steps are proposed to intensify the search.
Step 1. Store M best solutions in intermediate memory for T 1 number of iterations.
Step 2. Remove features that are not included in the best M solutions for N times.
Step 3. Rerun the tabu search with the reduced set of features for another T 2 iterations.
Step 4. Repeat Steps 1, 2, and 3 until the optimal or nearoptimal solution is achieved.
The values of M and N can be determined empirically through experiments.As an example, assume that the following four best solutions as shown in Figure 3 are found by tabu search during T 1 iterations.Features f 1 and f 3 are always used while feature f 5 is never used for good solutions.For N = 2, the reduced feature set consists of only f 1 , f 2 , f 3 , and f 6 .Thus, tabu search will search for the near-optimal solutions in reduced search space to avoid visiting nonpromising regions. is the number of occurrences of each feature in the best solutions.

EXPERIMENTS AND DISCUSSION
The tabu search in combination with 1-NN classifier is tested on two data sets reported in [2,3].In order to offset any bias due to the different range of values for the original features, the input feature values are normalized over the range [1,11] by using (5) [23].Normalizing the data is important to ensure that the distance measure allocates equal weight to each variable.Without normalization, the variable with the largest scale will dominate the measure: x i, j = x i, j − min k=1,...,n x (k, j) max k=1,...,n x (k, j) − min k=1,...,n x (k, j) * 10 + 1, (5) where x i, j is the jth feature of the ith pattern, x i, j is the corresponding normalized feature, and n is the total number of patterns.
The first data set consists of textured multispectral images taken at 16 spectral channels (from 500 nm to 650 nm) [2,24].592 different samples (multispectral images) of size 128 * 128 have been used to carry out the analysis.The samples are routinely viewed at prostatic section seen at low power (x 40 objective magnification) by two highly experimented independent pathologists and labelled into four classes: 165 cases of stroma, 106 cases of BPH, 144 cases of PIN, and 177 cases of PCa.The size of the feature vector is 128 (16 bands * 8 features(2 structural + 6 Haralick)).
The second data set is derived from prostatic nuclei extracted from prostate tissue [3].Nuclei are imaged under high power (x 100 objective magnification).These prostatic nuclei are taken at 33 spectral channels (from 400 nm to 720 nm).230 different images of size 256 * 256 have been used to carry out the analysis.The samples are labelled into 3 classes: 63 cases of BPH, 79 cases of PIN, and 88 cases of PCa.The size of feature vector is 266 (33 bands * 8 features(3 statistical+5 Haralick)+2 morphology features).
Table 1 shows the classification error for the first data set which is reported in [2] with data reduction by using principal component analysis and classification using classical  linear discrimination method.Table 2 shows the minimum classification error obtained by using fuzzy tabu search and classification results using leave-one-out 1-NN classifier.The classification accuracy has been increased for all cases.The overall classification error has been reduced to 2.90% as compared to 5.57% reported in [2].Furthermore, the number of features used is only 16 out of the available 128 as compared to 20 obtained using principal component analysis in [2].
Figure 4 shows features vector size versus classification error rate.Our proposed algorithm is capable to obtain superior classification accuracy for various feature sizes when compared with technique used in [2].Even the classification accuracy for the feature vector size of 10 is better than reported.
Table 3 shows the classification error for the second data set which is reported in [3] with data reduction by using principal component analysis and classification using classical quadratic discrimination method.Table 4 shows the minimum classification error obtained by using fuzzy tabu search and classification results using leave-one-out 1-NN classifier.The classification accuracy has been increased for all cases.The overall classification error has been reduced to 0.91% as compared to 5.1% reported in [3].The number of features used is only 13 out of the available 266 as compared to 25 obtained using principal component analysis in [3].
Figure 5 shows features vector size versus classification error rate.Our proposed algorithm is capable to obtain superior classification accuracy for various feature sizes when compared with technique used in [3].Even the classification accuracy for feature vector size of 10 is better than reported.

Quality of solutions by tabu search
Since the features are randomly added or deleted during tabu search, it is imperative that tabu search algorithm has the ability to find optimal or near-optimal solutions during different runs.Table 5 shows different solutions obtained by tabu search for different runs.If tabu search is executed with the same input and settings, it is highly likely that an acceptably accurate solution will be reached as shown in Table 5.
Figure 6 shows the fuzzy membership function, the number of incorrect predictions, the number of features, and classification error versus the number of iterations during the solution search space using tabu search.The figure clearly shows how well focused is tabu search on the good solution space.From the curve, it can be clearly seen that tabu search rapidly converged to the feasible/infeasible region border in all these objectives.Figure 7 depicts different objective functions versus the number of iterations after reducing the size of feature set by using intensification technique as mentioned in Section 2. From these curves, it can be seen that the searching for the best solutions is now limited to only good solution space, that is, fuzzy membership is in the range between 0.15 and 0.25 for most of the time while the same fuzzy membership was in the range between 0.2 and 0.4 before intensification.

Runtime parameters for tabu search
Table 6 shows the tabu runtime parameters chosen after experimentation with different values.The values of M and N as mentioned in Section 3.1 are 50 and 10, respectively.Table 7 shows the results of the test for different values of M and N. A large value of M and a small value of N cause tabu search to converge slowly and search for the near-optimal solutions in almost the same search space as without intensification.On the contrary, a small value of M and a large value of N make the search less intensive.In practice, medium values of M and N are more appropriate for intensification.

A discussion on computation time
The execution time to reduce the dimensionality of feature vector using tabu search and 1-NN classifier is much higher than reducing the dimensionality using PCA and LDA (10-12 hours as compared to few seconds).This can be explained by the fact that the former belongs to the class of iterative heuristics where for each iteration, different solutions are explored while the latter is based on only single solution.However, the feature selection using TS/1-NN is an offline procedure which is used to find the best subset of features while keeping the classification error rate low.Once the feature selection procedure finds the best subset of features, 1-NN classifier is used to determine the class of the new sample (a multispectral image) which provides an online diagnosis decision for the pathologist.Thus, the main advantage is that the feature selection leads to savings in measuring features as most of the features are discarded.The selected features have retained their original physical meanings while for feature extraction the transformed features have no clear physical meaning [4].
Table 8 shows the computation time comparison when determining the class of new sample.The execution time using feature extraction is much higher for both data sets when compared with feature selection.As clear from the table, by using feature selection, the cost of classification is reduced by limiting the number of features which must be measured for feature extraction.

CONCLUSION
In this paper, a tabu search method with intermediate-term memory is proposed for feature selection problem of large feature vector size.The cost of feature subset is measured by leave-one-out correct-classification rate of 1-NN classifier.The feature selection using tabu search has proved to be effective for the classification of prostate needle biopsies.
Results have indicated a significant boost in the performance both in terms of minimizing the features and maximizing the classification accuracy.This method is quite generic and can be used with different classifiers of considerable dimensionality.Furthermore, the proposed tabu search progressively zoomed towards a better solution subspace as time elapsed, a desirable characteristics of approximation iterative heuristics.Future research is to exploit the parallelism inherent in tabu search to speed up the search.

Figure 2 :
Figure2: Membership function for fuzzy subset X, where X is the number of features F, the number of incorrect predictions P, or the classification error rate E in this application.

Figure 3 :
Figure 3: An example showing intensification steps for tabu search.is the number of occurrences of each feature in the best solutions.

Figure 4 :
Figure 4: Features versus classification error for data set 1.

Figure 5 :
Figure 5: Features versus classification error for data set 2.

Figure 6 :
Figure 6: Objective functions versus number of iterations for data set 1 before intensification.

Figure 7 :
Figure 7: Objective functions versus number of iterations for data set 1 after intensification.

Table 2 :
Classification error by using feature selection through tabu search.

Table 4 :
Classification error by using feature selection through tabu search.

Table 5 :
Classification error (%) for different runs using tabu search.

Table 6 :
Tabu runtime parameters.P = parameters, V * = number of neighborhood solutions, T = tabu list size, and I = number of iterations.

Table 7 :
Comparison of different values of M and N. E = classification error rate and F = number of selected features.

Table 8 :
Computation time comparison.M = measuring features cost, DR = data reduction, C = classification, and T = total time.