Rough sets
An information system can be expressed by a fourparameters group[10]: S = {U R V f}. U is a finite and nonempty set of objects called the universe, and R = C ∪ D is a finite set of attributes, where C denotes the condition attributes and D denotes the decision attributes. V =∪ v_{
r
},(r∈R) is the domain of the attributes, where v_{
r
} denotes a set of values that the attribute r may take. f:U × R→V is an information function. The equivalence relation R partitions the universe U into subsets. Such a partition of the universe is denoted by U/R = E_{1},E_{2},…,E_{
n
}, where E_{
i
} is an equivalence class of R. If two elements u v ∈ U belong to the same equivalence class E ⊆ U/R, u and v are indistinguishable, denoted by ind(R). If ind(R) = ind(R−r), r is unnecessary in R. Otherwise, r is necessary in R.
Since it is not possible to differentiate the elements within the same equivalence class, one may not obtain a precise representation for a set X ⊆ U. The set X, which can be expressed by combining sets of some R basis categories, is called set defined, and the others are rough sets. Rough sets can be defined by upper approximation and lower approximation. The elements in the lower bound of X definitely belong to X, and elements in the upper bound of X belong to X possibly. The upper approximation and lower approximation of the rough set R can be defined as follows[11]:
\underset{\_}{R}\left(X\right)=\cup \left\{Y\in \frac{U}{R:Y\subseteq X}\right\}
(1)
\overline{R}\left(X\right)=\cup \left\{Y\in \frac{U}{R:Y\cap X\ne \varnothing}\right\}
(2)
where\underset{\_}{R}\left(X\right) represents the set that can be merged into X positively, and\overline{R}\left(X\right) represents the set that is merged into X possibly.
Suppose P and Q are both the equivalent relationship of system U, and the knowledge systems decided by them are U/P = {[x]_{
P
}x∈U} and U/Q = {[y]_{
Q
}y∈U}. If for any [x]_{
P
}∈(U/P),\overline{Q}\left({\left[x\right]}_{P}\right)=\underset{\_}{Q}\left({\left[x\right]}_{P}\right)={\left[x\right]}_{P}, then knowledge P is dependent on knowledge Q completely, that is to say when disquisitive object is some characteristic of Q, it must be some characteristic of P. P and Q are of definite relationship. If knowledge P is dependent on knowledge Q partly, P and Q are of uncertain relationship. So the dependent extent of knowledge P to knowledge Q is defined as[10]
{\gamma}_{Q}=\text{PO}{S}_{Q}\left(P\right)/\leftU\right
(3)
where\text{PO}{S}_{Q}\left(P\right)=\cup \underset{\_}{Q}\left(x\right) and 0 ≤ γ_{
Q
}≤ 1. The value of γ_{
Q
} reflects the dependent degree of knowledge P to knowledge Q. γ_{
Q
}= 1 shows knowledge P is dependent on knowledge Q completely; γ_{
Q
} close to 1 shows knowledge P is dependent on knowledge Q highly. γ_{
Q
}= 0 shows knowledge P is independent of knowledge Q.
Rough kmeans algorithm
The kmeans algorithm is one of the most popular iterative descent clustering algorithms[12]. The basic idea is to make the samples have high similarity in a class, and low similarity among classes. The center of a cluster can be given by:
{t}_{i}=\frac{\sum _{x\in {X}_{i}}x}{\text{card}\left({X}_{i}\right)},i=1,2,\dots ,I
(4)
where x denotes the sample to cluster, X_{
i
}denotes the cluster i, card(X_{
i
}) denotes the number of the elements in X_{
i
}, and I denotes the number of clusters.
The kmeans algorithm is efficient for clustering. But kmeans clustering algorithm has the following problems:

1.
The number of clusters in the algorithm must be given before clustering [13].

2.
The kmeans algorithm is very sensitive to the initial center selection and can easily end up with a local minimum solution [13, 14].

3.
The kmeans algorithm is also sensitive to the isolated point [15].
To overcome the problem of isolated points, Pawan and West[15] proposed the rough kmeans algorithm. This method introduces upper approximation and lower approximation into kmeans clustering algorithm. The improved cluster center is given by[15]:
{C}_{j}=\left\{\begin{array}{l}{\omega}_{\text{lower}}\times \frac{{\sum}_{v\in \underset{\_}{A}(x)}{v}_{j}}{\underset{\_}{A}(x)}+{\omega}_{\text{upper}}\\ \phantom{\rule{3em}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\times \frac{{\sum}_{v\in (\overline{A}(x)\underset{\_}{A}(x))}{v}_{j}}{\overline{A}(x)\underset{\_}{A}(x)}& \text{if}\overline{A}(x)\underset{\_}{A}(x)\ne \varnothing \\ {\omega}_{\text{lower}}\times \frac{{\sum}_{v\in \underset{\_}{A}(x)}{v}_{j}}{\underset{\_}{A}(x)}& \text{otherwise}\end{array}\right.
(5)
where the parameters ω_{lower} and ω_{upper} are lower and upper subject degrees of X relative to their clustering centers. For each object vector v, d(x,t_{
i
}) denotes the distance between the center of cluster t_{
i
} and the sample. The lower and upper subject degrees of x relative to its cluster is based on the value of d(x,t_{
i
})−d_{min}(x)(1 ≤ i ≤ I), where d_{min}(x) = min_{i∈[1,I]}d(x,t_{
i
}). If the value of d(x,t_{
i
})−d_{min}(x) ≥ λ, the sample x is subject to the lower approximation of its cluster, where λ denotes the threshold for determining upper and lower approximation. Otherwise, x will be subject to the upper approximation. The comparative degree can be determined by the number of elements in the lower approximation set and the upper approximation set, as follows:
\frac{{\omega}_{\text{lower}}(i)}{{\omega}_{\text{upper}}(i)}=\frac{\left\stackrel{\text{\_}}{A}\right({X}_{i}\left)\right}{\left\underset{}{A}\right({X}_{i}\left)\right},\left(\underset{}{A}\right({X}_{i})\ne \varnothing )
(6)
{\omega}_{\text{lower}}\left(i\right)+{\omega}_{\text{upper}}\left(i\right)=1.
(7)
SVM
In this section, we give a very brief introduction to SVM. Let (x_{
i
},y_{
i
})_{1 ≤ i ≤ N} be a set of training examples, each example x_{
i
}∈ R^{d}, d being the dimension of the input space, belongs to a class labeled by y ∈ {−1,1}. It amounts to finding w and b, which satisfy
{y}_{i}\left[\right(\mathbf{w}\xb7{\mathbf{x}}_{i})+b]\ge 1.
(8)
The aim of SVM is to find the hyperplane which makes the samples with the same label at the same side of the hyperplane. The quantity\frac{\left\right\mathbf{w}\left\right}{2} is named the margin, and optimal separating hyperplane (OSH) is the separating hyperplane which maximizes the margin. The larger the margin, the better the generalization is expected to be[16].
To search the minimum\frac{\left\right\mathbf{w}\left\right}{2}, Lagrange multiplier is usually used, leading to maximizing
\text{w}\left(\alpha \right)=\sum _{i=1}^{N}{\alpha}_{i}\frac{1}{2}\sum _{i,j=1}^{N}{\alpha}_{i}{\alpha}_{j}{y}_{i}{y}_{j}({\mathbf{x}}_{i}\xb7{\mathbf{x}}_{j})
(9)
subject to
\sum {\alpha}_{i}{y}_{i}=0
(10)
where α = (α_{1},…,α_{
N
}) denotes the nonnegative Lagrange multipliers, x_{
i
} denotes the input of the training data and y_{
i
}denotes the output of the training data[17].
The decision function is
f\left(x\right)=\mathit{\text{sign}}[\sum _{i=1}^{N}{y}_{i}\xb7{a}_{i}({\mathbf{x}}_{i}\xb7\mathbf{x})+b]
(11)
In the nonlinear case, the approach adapted to noisy data is to make a soft margin. We introduce the slack variables (ξ_{1},…,ξ_{
i
}) with ξ_{1} > 0 so that
{y}_{i}\left[\right(\mathbf{w}\xb7{\mathbf{x}}_{i})+b]\ge 1{\xi}_{i},\phantom{\rule{1em}{0ex}}i=1,\dots ,\mathrm{N.}
(12)
The generalized OSH is the solution of minimizing
\frac{1}{2}\mathbf{w}\xb7\mathbf{w}+C\sum _{i=1}^{N}{\xi}_{i}
(13)
subject to (12) and ξ_{
i
}> 0. The parameter ∑ξ_{
i
}is the upper bound on the number of training errors and C is the penalty parameter to control errors.
In the nonlinear SVM, a kernel function is introduced to change the initial data into a feature space with high dimension. In the new space the data should be linearly separable. Then the quadratic optimization problem can be converted to maximize
\text{w}\left(\alpha \right)=\sum _{i=1}^{N}{\alpha}_{i}\frac{1}{2}\sum _{i,j=1}^{N}{\alpha}_{i}{\alpha}_{j}{y}_{i}{y}_{j}K({\mathbf{x}}_{i}\xb7{\mathbf{x}}_{j})
(14)
subject to (10) and 0 ≤ α_{
i
}≤ C. K(x,x_{
i
}) is the kernel function. As one of the most popular kernel functions, the RBF kernel function is considered in this article, and it takes the following form[18, 19]:
K(x,{x}_{i})=exp\{\gamma \mathbf{x}{\mathbf{x}}_{i}{}^{2}\}.
(15)
(14) is converted to maximize
\text{w}\left(\alpha \right)=\sum _{i=1}^{n}{\alpha}_{i}\frac{1}{2}\sum _{i=1}^{n}\sum _{j=1}^{n}{y}_{i}{y}_{j}{\alpha}_{i}{\alpha}_{j}exp\left\{\gamma \left\right{\mathbf{x}}_{i}{\mathbf{x}}_{j}{}^{2}\right\}
(16)
subject to (10) and 0 ≤ α_{
i
}≤ C. The new decision function is:
f\left(x\right)=\mathit{\text{sign}}\left(\sum _{i=1}^{{N}_{s}}{y}_{i}\xb7{a}_{i}\xb7K(\mathbf{x},{\mathbf{x}}_{i})+b\right)
(17)
The result of the minimization is determined by the selection of parameters C and γ. Usually, C and γ are determined by using cross validation.