Fast Nonnegative Matrix Factorization and Its Application for Protein Fold Recognition

Linear and unsupervised dimensionality reduction via matrix factorization with nonnegativity constraints is studied. Because of these constraints, it stands apart from other linear dimensionality reduction methods. Here we explore nonnegative matrix factorization in combination with three nearest-neighbor classiﬁers for protein fold recognition. Since typically matrix factorization is iteratively done, convergence, can be slow. To speed up convergence, we perform feature scaling (normalization) prior to the beginning of iterations. This results in a signiﬁcantly (more than 11 times) faster algorithm. Justiﬁcation of why it happens is provided. Another modiﬁcation of the standard nonnegative matrix factorization algorithm is concerned with combining two known techniques for mapping unseen data. This operation is typically necessary before classifying the data in low-dimensional space. Combining two mapping techniques can yield better accuracy than using either technique alone. The gains, however, depend on the state of the random number generator used for initialization of iterations, a classiﬁer, and its parameters. In particular, when employing the best out of three classiﬁers and reducing the original dimensionality by around 30%, these gains can reach more than 4%, compared to the classiﬁcation in the original, high-dimensional space.


INTRODUCTION
It is not uncommon that for certain data sets, their dimensionality n is higher than the number of attributes or features m (here and further it is assumed that the data are accumulated in an n × m matrix accomodating m n-dimensional feature vectors). In such cases, the effect, referred to as curse of dimensionality, occurs that negatively influences the clustering and classification of a given data set. Dimensionality reduction is typically used to cure or at least to mitigate this effect and can be done by means of feature extraction (FE) or feature selection (FS). FS selects a subset of the original features based on a certain criterion of feature importance or relevance whereas FE produces a set of transformed (i.e., new) features from the original ones. Features chosen by FS are easy to interpret while those found by FE may be not. In addition, FS often assumes knowledge of class membership information, that is, it is often supervised, in contrast to FE that is usually unsupervised. Thus, FS looks naturally more attractive than FE from the classification viewpoint, that is, when FS is followed by classification of a data set. However, there can be the cases where all or almost all original features turn out, to be important (relevant) so that FS becomes inadequate. If this happens, the alternative is FE, and such a case is considered in this paper.
The simplest way to reduce dimensionality is to linearly transform the original data. Given the original, highdimensional data gathered in an n × m matrix V, a transformed or reduced matrix H, composed of m r-dimensional vectors (r < n and often r n), is obtained from V according to the following linear transformation: W: V ≈ WH (symbol ≈ indicates that an exact reconstruction of the original data is unlikely to happen in general), where W is an n × r (basis) matrix. It is said that W and H are the factorized matrices and WH is a factorization of V. Principal component analysis (PCA) [1] and independent component analysis (ICA) [2] are well-known techniques performing this operation.
Nonnegative matrix factorization (NMF) also belongs to this class of methods. Unlike the others, it is based on nonnegativity constraints on all matrices involved. Thanks to this fact, it can generate a part-based representation since no subtractions are allowed. Lee and Seung [3] proposed a simple iterative algorithm for NMF and proved its convergence. The factorized matrices are initialized with positive random numbers before starting matrix updates.
It is well known that initialization is of importance for any iterative algorithm: properly initialized, an algorithm converges faster. However, this issue was not yet investigated 2 EURASIP Journal on Applied Signal Processing in case of NMF. In order to speed up convergence, we propose to perform feature scaling (normalization) before iterations begin so as to bring values of all three matrices involved in factorization within the same range (in our case, between 0 and 1). Justification of why this change leads to faster convergence is provided.
Since dimensionality reduction is typically followed by classification in low-dimensional space, it is important to know when the error rate in this space is lower than that in the original space. Regarding classification, we propose to combine two known techniques for mapping unseen data prior to classification in low-dimensional space. For certain values of the state of the random number generator used to initialize matrices W and H, this combination results in higher accuracy in the low-dimensional space than in the original space. The gains depend not only on the state of the random number generator, but also on a classifier and its parameters.

Original nonnegative matrix factorization
Given the nonnegative matrices V, W, and H whose sizes are n×m, n×r, and r × m, respectively, we aim at such factorization that V ≈ WH. The value of r is selected according to the rule r < (nm)/(n + m) in order to obtain data compression. 1 Each column of W is a basis vector while each column of H is a reduced representation of the corresponding column of V. In other words, W can be seen as a basis that is optimized for linear approximation of the data in V.
NMF provides the following simple learning rule guaranteeing monotonical convergence to a local maximum without the need for setting any adjustable parameters [3]: 1 For dimensionality reduction, it is, however, sufficient if r < n.
The matrices W and H are initialized with positive random values. Equations (1)-(3) iterate until convergence to a local maximum of the following objective function: 2 In its original form, NMF can be slow to converge to a local maximum for large matrices and/or high data dimensionality. On the other hand, stopping after a predefined number of iterations, as sometimes is done, can be too premature to get a good approximation. Introducing a parameter tol (0 < tol 1) to decide when to stop iterations significantly speeds up the convergence without negatively affecting the mean-square error (MSE), measuring the approximation quality 3 . That is, iterations stop when F new − F old < tol.
After learning the NMF basis functions, that is, the matrix W, new (previously unseen) data in the matrix V new are mapped to r-dimensional space by fixing W and using one of the following techniques: (1) randomly initializing H as described above and iterating (3) until convergence [9]; (2) as a least-squares solution of V new = WH new , that is, [8].
Further we will call the first technique iterative while the second direct, because the latter provides a straightforward noniterative solution. The direct technique can produce negative entries of H new , thus violating nonnegativity constraints. There are two possible remedies of this problem: (1) enforcing nonnegativity by setting negative values to zero and (2) using nonnegative least squares. Each solution has its own pros and cons. For instance, setting negative values to zero is much more computationally simpler than solving least squares with nonnegativity constraints, but some information is lost after zeroing. On the other hand, the nonnegative least-squares solution has no negative components, but it is known that it may not fit as well as the least-squares solution without nonnegativity constraints. Since our goal is to accelerate convergence, we prefer the first (zeroing) solution when employing the direct technique.

Modified nonnegative matrix factorization
We propose two modifications of the original iterative NMF algorithm. The first modification is concerned with feature scaling (normalization) linked to the initialization of the factorized matrices. Typically, these matrices are initialized with positive random numbers, say uniformly distributed between 3 0 and 1, in order to satisfy the nonnegativity constraints. Hence, elements of V (matrix of the original data) also need to be within the same range. Given that V j is an ndimensional-feature vector, where j = 1, . . . , m, its components V i j are normalized as follows: V i j /V k j , where k = arg max l V l j . In other words, components of each feature vector are divided by the maximal value among them. As a result, feature vectors are composed of components whose nonnegative values do not exceed 1. Since all three matrices (V, W, H) have now entries between 0 and 1, it takes much less time to perform matrix factorization V ≈ WH (values of the entries in the factorized matrices do not have to grow much in order to satisfy the stopping criterion for the objection function F in (4)) than if V had the original (unnormalized) values. As additional benefit, MSE becomes much smaller too.
Though this modification is simple, it brings significant speed of convergence. The following theorem helps to understand why it happens. Theorem 1. Assume that F direct and F iter are values of the objective function in (4) when mapping the data with the direct and iterative techniques, respectively. Then F direct − F iter ≥ 0 always holds at the start of iterations.
Proof. By definition, Let us introduce a new variable, x : Figure 1. It can be seen that this function is always nonnegative.
The higher x, that is, the bigger the ratio of V i j to WH i j (the case of unnormalized V), the larger the difference between F direct and F iter . In other words, if no normalization of V occurs, the direct mapping technique moves the beginning of iterations far away from the point where the conventional iterative technique starts since the objective function in (4) is increasing [3]. This, in turn, implies that normalization can significantly speed up convergence. On the other hand, as follows from Figure 1, the only minimum occurs at x = 1, which means V = WH and F direct = F iter . In practice, the strict equalities do not hold because of zeroing some entries in H. This means that both direct and iterative techniques start from approximately the same point if V is normalized as described above. To remedy the effect of zeroing, we propose to add a small random number, uniformly distributed between 0 and 1, to each entry of H new obtained after applying the direct technique. After that, the iterative technique is used for mapping unseen data. In this way, we combine both mapping techniques. This is our second modification and the proposed technique is called it-erative2.

Summary of our algorithm
Suppose that the whole data set is divided into training and test (unseen) sets. Our algorithm is summarized as follows.
(1) Scale both training and test data and randomly initialize the factorized matrices as described in Section 2.1. Set parameters tol and r.

Task
As a challenging task, we selected protein fold recognition from bioinformatics. Protein is an amino acid sequence. In 4 EURASIP Journal on Applied Signal Processing

Source
Classifier Error rate (%) [28] DIMLP 61.8 [29] M L P 5 1 .2 [29] G R N N 5 5 .8 [29] R B F N 5 0 .6 [29] S V M 4 8 .6 [30] R B F N 4 8 .8 [31] S V M 4 6 .1 [32] H K N N 4 2 .7 bioinformatics, one of the current trends is to understand evolutionary relationships in terms of protein function. Two common approaches to identify protein function are sequence analysis and structure analysis. Sequence analysis is based on a comparison between unknown sequences and those whose function is already known. However, some closely related sequences may not share the same function.
On the other hand, proteins may have low sequence identity but their structure and, in many cases, function suggest a common evolutionary origin. 4 Protein fold recognition is structure analysis without relying on sequence similarity. Proteins are said to have a common fold if they have the same major secondary structure 5 in the same arrangement and with the same topology, whether or not they have a common evolutionary origin. The structural similarities of proteins in the same fold arise from the physical and chemical properties favoring certain arrangements and topologies, meaning that various physicochemical features such as fold compactness or hydrophobicity are utilized for recognition. As the gap widens between the number of known sequences and the number of experimentally determined protein structures (the ratio is more than 100 to 1, and sequence databases are doubling in size every year), the demand for automated fold recognition techniques rapidly grows.

Data set
A challenging data set derived from the SCOP (structural classification of proteins) database [33] was used in experiments described below. It is available on line 6 and its detailed description can be found in [31]. The data set contains the 27 most populated folds represented by seven or more proteins. Ding and Dubchak already split it into the training and test sets, which we will use as other authors did.
Six-feature sets compose the data set: amino acids composition, predicted secondary structure, hydrophobicity, normalized van der Waals volume, polarity, and polarizability. A feature vector combining six features has 125 dimensions. The training set consists of 313 protein folds having (for each two proteins) no more than 35% of the sequence identity for aligned subsequences longer than 80 residues. The test set of 385 folds is composed of protein sequences of less than 40% identity with each other and less than 35% identity with the proteins of the first set. In fact, 90% of the proteins of the test set have less than 25% sequence identity with the proteins of the training set. This, as well as multiple classes many of which are sparsely represented in the training set, render the task extremely difficult.

Previous work
All approaches, briefly mentioned below, use the data set described in the previous section. Unless otherwise stated, a 125-dimensional feature vector is assumed for each protein fold. In order to provide fair comparison, we concentrate on single classifiers rather than ensembles of classifiers. All but one papers below do not utilize dimensionality reduction prior to classification.
Ding and Dubchak [31] employed support vector machines (SVMs) (one-versus-all, unique one-versus-all, and one-versus-one methods for building multiclass SVMs). Bologna and Appel [28] used a 131-dimensional feature vector (protein sequence length was added to other features) and a four-layer discretized interpretable multi layer perceptron (DIMLP). Chung et al. [29] selected different models of neural networks (NNs) with a single hidden layer (MLP, radial basis function network (RBFN), and general regression neural network (GRNN)) and SVMs as basic building blocks for classification. Huang et al. [34] exploited a similar approach by utilizing gated NNs (MLPs and RBFNs). Gating is used for on-line feature selection in order to reduce the number of features fed to a classifier. Gates are open for useful features and they are close for bad ones. First, the original data are used to train the gating network. At the end of the training, the gate-function values for each feature indicate whether a particular feature is relevant or not by comparing these values against a threshold. Only the relevant features are then used to train a classifier. Pal and Chakraborty [30] trained MLPs and RBFNs with new features (400 in number) based on the hydrophobicity of the amino acids. In some cases, the 400-dimensional feature vectors led to a higher accuracy than when using the traditional (125-dimensional) ones. Okun [32] applied a variant of the nearest-neighbor classifier (HKNN). Table 1 summarizes the best results achieved with the above mentioned methods.
As one can observe, the error rates when employing a single classifier are high due to the discussed challenges. Ensembles of classifiers can sometimes reduce the error rate to about 39% as demonstrated in [28], but their consideration is beyond this work. For HKNN, the normalized features were used since feature normalization to zero mean and unit variance prior to HKNN dramatically increases classification accuracy (on average, by 6% [35]). However, this normalization can produce negative features and, thus, it is not appropriate for NMF requiring nonnegativity constraints to hold. This is the reason why we prefer another feature scaling in Section 2.2.

EXPERIMENTS
Experiments with NMF involve estimation of the error rate when performing classification in low-dimensional space as well as time until convergence on the data set described in Section 3.2. Three techniques for mapping test data to this space are used: direct, iterative, and iterative2. Regarding NMF, matrices W and H are initialized with random numbers uniformly distributed between 0 and 1 when the state of the random number generator ranged from 1 to 10. The same value of the state is used for mapping both training and test data. The value of tol was fixed to 0.01.
Though we tried numerous values of r (dimensionality of reduced space) between 50 and 100, we report the best results obtained with r = 88, 7 which constitutes 70.4% of the original dimensionality.
All algorithms were implemented in MATLAB running on a Pentium 4 (3 GHz CPU, 1 GB RAM).

Classifiers
We studied three classifiers: standard k-nearest neighbor (KNN) [36], k-nearest neighbor (KKNN) [37], and k-local hyperplane distance nearest neighbor (HKNN) [38]. HKNN was selected, since it demonstrated a competitive performance compared to SVM when both methods were applied to classify the above-mentioned protein data set in the original space [32,39], that is, without dimensionality reduction. In addition, when applied to other data sets, the combination of NMF and HKNN showed very good results [40], thus rendering HKNN as a natural selection for NMF. Because of this reason, KNN and its kernel variant were selected for comparison with HKNN since all three algorithms belong to the same group of classifiers.
KNN has one parameter to be set: the number of nearest neighbor, k. Typical values for it are 1, 3, and 5.
KKNN is a modification of KNN when applying kernels. When selecting an appropriate kernel, the kernel nearestneighbor algorithm, via a nonlinear mapping to a highdimensional feature space, may be superior over KNN for some sample distributions. The same kernels as in case of SVM are commonly used, but as remarked in [37], only the polynomial kernel with degree p = 1 is actually useful, since the polynomial kernel with p = 1 and radial basis (Gaussian) kernel degenerate KKNN to KNN. The kernel approach to KNN consists of two steps: kernel computation, followed by distance computation in the feature space expressed via a kernel. After that, a nearest-neighbor rule is applied just as in the case of KNN. We tested KKNN for all combinations of p = 0.5, 2, 3, 4, 5, 6, 7 and k = 1, 3, 5 (21 combinations of parameters in total).
HKNN is another modification of KNN intended to compete with SVM when KNN fails to do so. HKNN computes distances of each test point x to L local hyperplanes, where L is the number of different classes. The th hyperplane is composed of k nearest neighbors of x in the training set, belonging to the th class. A test point x is associated with the class whose hyperplane is closest to x. HKNN needs to predefine two parameters, k and λ (regularization). Their values are 6 and 7 for k and 8, 10,12,20,30,40,50 for λ (hence 14 combinations of two parameters in total). The value k = 7 is the largest possible value to choose since the minimum number of protein folds per class in the training set is seven. Table 2 summarizes the error ranges for three classifiers when doing classification in the original and NMF spaces for r = 88 and parameters of each classifier given in Section 4.1. In the first column, "NMF-Direct" stands for the NMF space and direct technique used to map test data to this space. "NMF-Iterative" and "NMF-Iterative2" mean the same regarding the iterative and iterative2 techniques, respectively.

Classification results
First, we would like to analyze each classifier separately from the others. KNN using normalized features in the original space is clearly the best, since it yields the lowest minimum error as well as the narrowest range of errors. KKNN applied in the NMF-Iterative and NMF-Iterative2 spaces can sometimes lead to errors smaller than those in the original space, but the ranges of errors achieved for low-dimensional spaces are significantly wider than the range for the original space. This fact emphasizes sensitivity of KKNN to its parameter settings when the data dimensionality is reduced by 30%. Finally, HKNN employed in the NMF-Iterative and NMF-Iterative2 spaces demonstrated clear advantages of dimensionality reduction. Though the iterative technique had slight edge (only in 6 out 140 experiments) over the iterative2 technique in terms of minimum error achieved (this error is the lower for both iterative techniques than error in the original space!), the former has significantly larger maximum error than the latter. In addition, the iterative2 technique yields the same maximum error as that in the original space and this error is the lowest among all others. Hence, the itera-tive2 technique causes HKNN to have the smallest variance of error, thus making it least sensitive to different parameters. For each classifier, the direct technique for mapping test data to low-dimensional space lagged far behind either iterative technique in terms of classification accuracy. Therefore we do not recommend to apply it alone.
If one compares three classifiers, HKNN emerges as an undisputed winner in both high-and low-dimensional spaces. Based on the previous experience with this classifier [38,40], this fact is not surprising. Coupled with our modifications of NMF, it demonstrated a good performance exceeding that of many neural network models and SVM employed in high-dimensional space (see Table 1). In particular, compared to the classification in the original space, the minimum error in the NMF-Iterative2 space is lower by 4% (last column in Table 2).
We also noticed that for certain values of the state of the random number generator, the iterative2 technique provides a better classification accuracy than the iterative technique in more than a half of experimental cases. For example, these 6 EURASIP Journal on Applied Signal Processing states are 1, 5, 7, and 8 for HKNN, with states 1 and 8 shared with other two classifiers. With such states, the accuracy was better in the overwhelming number of cases. Thus, it seems that there is a link between high accuracy in the lowdimensional space of NMF and the state of the random generator, which needs further exploration.

Time
In this section, we provide the evidence that feature scaling prior to iterations significantly speeds up convergence when mapping both training and test data to low-dimensional space, compared to the case when no scaling is used. Table 3 accumulates gains resulted from feature scaling for several data dimensionalities. R 1 stands for the ratio of the average times spent on learning NMF basis and mapping training data to NMF space without and with scaling prior to NMF. R 2 means the ratio of the average times spent on mapping test data by means of the iterative technique without and with scaling prior to NMF. R 3 is the ratio of the average times spent on mapping test data by means of the iterative2 technique without and with scaling prior to NMF. Thus, the average gain from feature scaling is more than 11 times.

CONCLUSION
The main contribution of this work is two modifications of the basic NMF algorithm [3] and its practical application to the challenging real-world task of protein fold recognition.
The first modification carries out feature scaling before NMF while the second modification combines two known techniques for mapping unseen data. When modifying NMF, we considered two aspects: (1) time until convergence since factorization is done by means of an iterative algorithm and (2) the error rate when combining NMF and a classifier. Three nearest-neighbor classifiers were tested.
We demonstrated that proper feature scaling makes the NMF algorithm 11 times faster to converge. The reason of why it happens is explained based on Theorem 1 in Section 2.2. Regarding classification in low-dimensional space, our experimental results showed that simultaneously with faster convergence, significant gains in accuracy can be achieved, too, compared to the known results in the original, high-dimensional space. However, these gains depend on the state of the random number generator, a classifier, and its parameters.
where F 0 = i j (V i j log V i j −V i j ) and F = i j (V i j log(WH) i j − (WH) i j ). F 0 does not include (WH) i j and therefore does not have any effect on minimization and can be omitted. As a result, minimizing F 0 − F implies maximizing F, subject to the constraints.