Rough sets
An information system can be expressed by a four-parameters group[10]: S = {U R V f}. U is a finite and non-empty set of objects called the universe, and R = C ∪ D is a finite set of attributes, where C denotes the condition attributes and D denotes the decision attributes. V =∪ v
r
,(r∈R) is the domain of the attributes, where v
r
denotes a set of values that the attribute r may take. f:U × R→V is an information function. The equivalence relation R partitions the universe U into subsets. Such a partition of the universe is denoted by U/R = E1,E2,…,E
n
, where E
i
is an equivalence class of R. If two elements u v ∈ U belong to the same equivalence class E ⊆ U/R, u and v are indistinguishable, denoted by ind(R). If ind(R) = ind(R−r), r is unnecessary in R. Otherwise, r is necessary in R.
Since it is not possible to differentiate the elements within the same equivalence class, one may not obtain a precise representation for a set X ⊆ U. The set X, which can be expressed by combining sets of some R basis categories, is called set defined, and the others are rough sets. Rough sets can be defined by upper approximation and lower approximation. The elements in the lower bound of X definitely belong to X, and elements in the upper bound of X belong to X possibly. The upper approximation and lower approximation of the rough set R can be defined as follows[11]:
(1)
(2)
where represents the set that can be merged into X positively, and represents the set that is merged into X possibly.
Suppose P and Q are both the equivalent relationship of system U, and the knowledge systems decided by them are U/P = {[x]
P
|x∈U} and U/Q = {[y]
Q
|y∈U}. If for any [x]
P
∈(U/P),, then knowledge P is dependent on knowledge Q completely, that is to say when disquisitive object is some characteristic of Q, it must be some characteristic of P. P and Q are of definite relationship. If knowledge P is dependent on knowledge Q partly, P and Q are of uncertain relationship. So the dependent extent of knowledge P to knowledge Q is defined as[10]
where and 0 ≤ γ
Q
≤ 1. The value of γ
Q
reflects the dependent degree of knowledge P to knowledge Q. γ
Q
= 1 shows knowledge P is dependent on knowledge Q completely; γ
Q
close to 1 shows knowledge P is dependent on knowledge Q highly. γ
Q
= 0 shows knowledge P is independent of knowledge Q.
Rough k-means algorithm
The k-means algorithm is one of the most popular iterative descent clustering algorithms[12]. The basic idea is to make the samples have high similarity in a class, and low similarity among classes. The center of a cluster can be given by:
(4)
where x denotes the sample to cluster, X
i
denotes the cluster i, card(X
i
) denotes the number of the elements in X
i
, and I denotes the number of clusters.
The k-means algorithm is efficient for clustering. But k-means clustering algorithm has the following problems:
-
1.
The number of clusters in the algorithm must be given before clustering [13].
-
2.
The k-means algorithm is very sensitive to the initial center selection and can easily end up with a local minimum solution [13, 14].
-
3.
The k-means algorithm is also sensitive to the isolated point [15].
To overcome the problem of isolated points, Pawan and West[15] proposed the rough k-means algorithm. This method introduces upper approximation and lower approximation into k-means clustering algorithm. The improved cluster center is given by[15]:
(5)
where the parameters ωlower and ωupper are lower and upper subject degrees of X relative to their clustering centers. For each object vector v, d(x,t
i
) denotes the distance between the center of cluster t
i
and the sample. The lower and upper subject degrees of x relative to its cluster is based on the value of d(x,t
i
)−dmin(x)(1 ≤ i ≤ I), where dmin(x) = mini∈[1,I]d(x,t
i
). If the value of d(x,t
i
)−dmin(x) ≥ λ, the sample x is subject to the lower approximation of its cluster, where λ denotes the threshold for determining upper and lower approximation. Otherwise, x will be subject to the upper approximation. The comparative degree can be determined by the number of elements in the lower approximation set and the upper approximation set, as follows:
(6)
(7)
SVM
In this section, we give a very brief introduction to SVM. Let (x
i
,y
i
)1 ≤ i ≤ N be a set of training examples, each example x
i
∈ Rd, d being the dimension of the input space, belongs to a class labeled by y ∈ {−1,1}. It amounts to finding w and b, which satisfy
The aim of SVM is to find the hyperplane which makes the samples with the same label at the same side of the hyperplane. The quantity is named the margin, and optimal separating hyperplane (OSH) is the separating hyperplane which maximizes the margin. The larger the margin, the better the generalization is expected to be[16].
To search the minimum, Lagrange multiplier is usually used, leading to maximizing
(9)
subject to
where α = (α1,…,α
N
) denotes the non-negative Lagrange multipliers, x
i
denotes the input of the training data and y
i
denotes the output of the training data[17].
The decision function is
(11)
In the nonlinear case, the approach adapted to noisy data is to make a soft margin. We introduce the slack variables (ξ1,…,ξ
i
) with ξ1 > 0 so that
(12)
The generalized OSH is the solution of minimizing
(13)
subject to (12) and ξ
i
> 0. The parameter ∑ξ
i
is the upper bound on the number of training errors and C is the penalty parameter to control errors.
In the nonlinear SVM, a kernel function is introduced to change the initial data into a feature space with high dimension. In the new space the data should be linearly separable. Then the quadratic optimization problem can be converted to maximize
(14)
subject to (10) and 0 ≤ α
i
≤ C. K(x,x
i
) is the kernel function. As one of the most popular kernel functions, the RBF kernel function is considered in this article, and it takes the following form[18, 19]:
(15)
(14) is converted to maximize
(16)
subject to (10) and 0 ≤ α
i
≤ C. The new decision function is:
(17)
The result of the minimization is determined by the selection of parameters C and γ. Usually, C and γ are determined by using cross validation.