Cardiology knowledge free ECG feature extraction using generalized tensor rank one discriminant analysis

Applications based on electrocardiogram (ECG) signal feature extraction and classification are of major importance to the autodiagnosis of heart diseases. Most studies on ECG classification methods have targeted only 1- or 2-lead ECG signals. This limitation results from the unavailability of real clinical 12-lead ECG data, which would help train the classification models. In this study, we propose a new tensor-based scheme, which is motivated by the lack of effective feature extraction methods for direct tensor data input. In this scheme, an ECG signal is represented by third-order tensors in the spatial-spectral-temporal domain after using short-time Fourier transform on the raw ECG data. To overcome the limitations of tensor rank one discriminant analysis (TR1DA) inherited from linear discriminant analysis, we introduced a generalized tensor rank one discriminant analysis (GTR1DA). This approach involves considering the distribution of the data points near the classification boundary to calculate better projection tensors. The experimental results showed that the proposed method achieves greater classification accuracy than other vector- and tensor-based methods. Finally, GTR1DA features a better convergence property than the original TR1DA.


Introduction
As the Internet of Things technology matures, remote and centralized electrocardiogram (ECG) diagnostic platforms have been developed. A highly centralized platform has a large amount of ECG data to diagnose daily; therefore, an aided diagnosis system providing reference diagnostics is beneficial. An ECG represents the working state of the heart by recording the magnitude and direction of the electrical activities related to the heart. It provides useful information for doctors to diagnose heart diseases. In recent years, the automatic diagnosis of heart diseases through the analysis of ECG signals has drawn much attention. ECG feature extraction consists of determining a set of discriminant features to represent ECG data to achieve strong classification performance. Many technologies and algorithms on machine learning and signal processing have been used to achieve this goal. Zhao  [1]. Jen et al. described an approach that takes advantage of the neural networks for determining the features of a given ECG signal [2]. Pasolli et al. introduced an active learning method for ECG classification based on certain ECG morphology and temporal features [3]. Zhang et al. used the principal component analysis (PCA) algorithm to extract ECG features [4]. An ECG feature extraction scheme using independent component analysis (ICA) was presented by Wu et al. [5,6].
The main problem with the existing methods is their limitation to 1-or 2-lead ECGs because a public 12-lead ECG database is lacking. Traditional methods developed for 2-lead signals cannot be directly applied to 12-lead ECG signals. They typically use vectors to represent ECG signals; however, this is not a natural representation of ECG signals because a large amount of useful structural information is discarded. The structural information indicates the relative positions of the signals among different leads. The 12-lead ECG provides spatial information about the electrical activity of the heart in three approximately orthogonal directions. If all the information of the http://asp.eurasipjournals.com/content/2014/1/2 12-lead signals is considered, then robust features can be obtained and then used for an automatic analysis. Thus, a high classification accuracy and an efficient representation of the ECG signals can be achieved. Moreover, the time-frequency domain features can also help improve the effect of the classification. Therefore, our method involves converting an original 12-lead ECG signal into a spatial-spectral-temporal domain such that it becomes a third-order tensor.
Scanning all tensor ECG signals into a vector greatly increases the dimensions of the feature space. Consequently, the number of training samples seems extremely small compared with the large dimensions of the feature space; this problem is called the small sample size (SSS) problem [7]. Another problem is related to the number of unknown parameters that can lead to an overfitting problem according to the principle of Ockham's razor. Therefore, the objective is to use the tensor feature extraction method to retain as much of the data structure information as possible. A more natural method to represent ECG compared with vectorization is to use third-order tensors. If the appropriate technology is adopted, then additional information can be extracted from the ECG data, and, therefore, the SSS problem can be addressed. Li et al. successfully used a general tensor discriminant analysis for single-trial electroencephalogram (EEG) classification [8]. This approach has also been adopted in gait recognition [9]. Other tensor methods include multilinear principal component analysis (MPCA) [10,11], uncorrelated multilinear PCA (UMPCA) [12,13], and tensor rank one discriminative analysis (TR1DA) [14]. These approaches can be classified into two categories [15]: the methods corresponding to tensor-to-tensor projection and the methods corresponding to tensor-to-vector projection. GTDA [9] and MPCA [10,11] belong to the tensor-to-tensor category, whereas TR1DA [14] and UMPCA [12,13] belong to the tensor-to-vector projection category. Because TR1DA is an extension of linear discriminant analysis (LDA), it also exhibits the same limitations. More precisely, it does not enable the characteristics of the original data distribution and adjacent points of the different classes to be fully considered. In this study, we propose to improve the original TR1DA method and overcome its limitations. We therefore consider the distribution of the closing points of the different classes to calculate better projection tensors and improve the accuracy of the classification.
In this paper, we introduce a new tensor-based scheme to extract features from 12-lead ECG signals. We represent them using high-order tensors, i.e., 12-lead signals in the spatial-spectral-temporal domain. The tensors were constructed using short-time Fourier transform (STFT) on the raw ECG signals. Generalized TR1DA (GTR1DA) was used to reserve the projection tensors from which the features of the original tensors were obtained. Finally, a support vector machine (SVM) with Gaussian kernel was used for the classification in the feature space. We tested the proposed method on our patient database and achieved a high classification accuracy.
The paper is organized as follows. The data preprocessing approach and the tensor data achievement process are introduced in Sections 2.1 and 2.2, respectively. We then discuss the tensor calculation used in this paper in Section 3. Section 4 indicates the major limitations of LDA, and Section 5 introduces an extension of LDA. In Section 6, the main idea of TR1DA is presented. We then show how to generalize the original TR1DA into GTR1DA in Section 7. Section 8 justifies the proposed method. In Section 9, we introduce the initial value calculation used in our approach. In Section 10, we clarify and briefly explain how SVM was applied for multiclass classification. We then briefly introduce our ECG database in Section 11.1 and describe the convergence improvement and the accuracy of the classification in Section 11.2. Section 12 presents the conclusion.

Tensor-based scheme
Our work covered the data preprocessing, the tensor data computation using STFT, the tensor feature extraction and dimension reduction based on our new GTR1DA approach, and the classification of the multiclass. The block diagram is provided in Figure 1.

Data preprocessing
Raw ECG signals typically display strong background noise. To suppress the noise without ruining the data, we used common methods, such as wavelet transformation, to remove high-frequency noise and median filters Figure 1 The tensor-based scheme for ECG feature extraction. http://asp.eurasipjournals.com/content/2014/1/2 to eliminate the baseline drift. An ECG signal from a diagnosis is approximately 20 s long and comprises approximately 25 beats. Therefore another critical preprocessing step is to segment the signal into 'ECG pieces', each of which contains one heartbeat.

Short-time Fourier transform
The original signals represent the features in the spatialtemporal domain. Because ECG signals are nonstationary, we used STFT [16] instead of the regular Fourier transform to collect information on when a particular frequency component occurs. STFT can provide useful information regarding the time resolution of the spectrum. STFT involves extracting several frames from the signals and analyzing them using a time-sliding window such that the relation between the variation of the frequency and the time can be identified. We used STFT to transform the original signals into a spatial-spectraltemporal domain as high-dimensional third-order tensors [17]. Given a signal that varies over time, STFT was used to determine the sinusoidal frequency and phase the content of the local sections.
For a 12-lead (lead × time) ECG signal sample, s[l, n] represents the discrete-time signal at a time n for a lead l. Then, the STFT at a time n t and a frequency f is given by where w[n] is the window function that selectively determines the portion of s[l, n] to use for the analysis. In this study, we used the Hann window. Once STFT had been applied to the ECG signals, all of the signals were in a third-order tensor form for the remainder of the analysis. In accordance with previous studies on extending an EEG signal to a third-order tensor [18][19][20], we extended the original ECG signal such that we could effectively extract valuable features in the spatial-spectral-temporal domain. To appropriately manage this type of data, using a tensor-based learning approach was necessary.

Tensor operations
We first introduce definitions of tensor operations to fix the notations. In our paper, Mathcal font and uppercase letters denote tensors (e.g., X , Y, Z). Matrices are expressed as bold uppercase letters (e.g., X, B). Lowercase bold letters are used for vectors (e.g., u, a). Lowercase and uppercase letters are scalers (e.g., a, b, c, D, E).

Definition 1.
Tensor product. The tensor product of two vectors x ∈ R M and y ∈ R N is a matrix: which is a rank 1 tensor of mode 2. Here, 0 < i ≤ M and 0 < j ≤ N. The tensor product of three vectors x ∈ R M y ∈ R N , and z ∈ R S is a mode-3 tensor: which is also of rank 1. Here, 0 < i ≤ M, 0 < j ≤ N, and 0 < k ≤ S.
which is a tensor of mode M − 1.

Definition 3.
Multiple tensor product. The tensor product of multiple vectors forms a rank 1 tensor. To simplify its notation, we use the following form to represent the tensor product of several vectors:

Limitations of classical linear discriminant analysis
In this section, we review the main limitations of classical LDA, including the SSS problem, the low-rank problem, the heteroscedastic problem, the problem raised by the unreasonable between-class scatter matrix, and the unconsidered distribution structure between classes.

Small-size problem
In the classical LDA approach, the calculation of the ratio of the between-class covariance distance to the withinclass covariance distance is a generalized Rayleigh quotient problem. We illustrate it using the most widely used optimization problem, which is the first optimization problem in Equation 6.
The Lagrange multiplier method can be used to solve this problem. It is clear from Equation 6 that x has an infinite number of solutions. When x is multiplied by a constant, R (x) maintains the same value (only offsetting the numerator and denominator). Therefore, the length of x is always restricted, such that the denominator is 1. The http://asp.eurasipjournals.com/content/2014/1/2 restriction is used as a condition for the Lagrange method. The solution to this problem maximizes the equation: The above equation is a typical generalized eigenvalue problem. If the within-class scatter matrix S w is invertible, this problem can be converted into an ordinary eigenvalue problem, allowing us to calculate the result: The within-class scatter matrix is the sum of each class covariance matrix, and rank(C) ≤ rank(A) + rank(B); therefore, the rank of the within-class scatter matrix is at most s (the number of classes) less than the sample number.
In most cases, such as image processing, video analysis, FMRI, or CT, the dimensions of the original signal are considerably large. However, if the training set is excessively small and the number of cases is less than the dimension count, the application of LDA is limited, and modifications are required to solve the problem.

Class number 1: low-rank problem
The classical LDA approach involves calculating the between-class scatter matrix by computing the covariance matrix of each class data center. H b can also be expressed in the following form: (1) , C (2) , . . . , C (i) , . . . , C (s) Here, In the first equation, s is the class number. In the second equation, c is the mean of all of the data. The term n i represents the point count for the class i. The term c (i) n i is the n i th mean of class i. The vectors of each class are the same, and, therefore, their rank is 1. Because each column of the matrix is reduced by the average of all columns, the weighted average is 0. In other words, they are linearly correlated: Here, n i is the number of points in class i. Generally, the centers of each class are linearly independent; therefore, the rank of H b is (s − 1). The term s again represents the number of classes. Using the following lemma, it is easy to compute the rank of the between-class scatter matrix as (s − 1).

Lemma 1. For any m × n matrix A rank (A) = rank(A ) = rank(AA ) = rank(A A).
Using Lemma 1, it is easy to conclude that S b and H b share the same rank; specifically, one less than the number of classes s. Although s − 1 is an upper bound, the actual value is close. The rank of S b is less than the upper bound s − 1 only if the centers of the various classes are on the same line, which must pass through the origin. In practice, this case is extremely rare and, therefore, in most cases, the rank of S b is close to s − 1.
This reveals a paradox of classical LDA: if the number of classes is high, then the accuracy of the results is low, but if the number is comparatively low, then the dimension extracted using LDA is also considerably low. In this case, if two classes are used for calculating the projection matrix, and regardless of the height of the original data dimension, the calculated reduced dimension can only be 1. Therefore, in this condition, the classification result is greatly affected.

Heteroscedastic problem
Classical LDA assumes that the data distribution in each class is Gaussian: Here, A i represents the point of class i, and c (i) represents its mean. In this approach, it is assumed that the data distribution in each class approximates a Gaussian distribution and that the covariance matrices of all of the classes are similar. However, in practice, this assumption has not been verified because they might have some type of structure in the feature space or a weak clustering structure. Data for various classes may not be linearly or even nonlinearly separable. Different class data can also overlap.

Unreasonable between-class scatter matrix
In the traditional LDA approach, the design of the between-class scatter matrix is unreasonable as S b . The design of the matrix involves calculating the betweenclass covariance distance of class data centers. This approach seems simple and effective but exhibits several problems. It entails the assumption that the centers of each class, as a whole, approximate a Gaussian distribution. This is often not the case for real data. Conversely, if the number of classes is low, then the data centers can be considerably sparse. In this situation, even if they have a Gaussian distribution, the calculated axis direction can be incorrect because of the sparsity of the data, hardening the http://asp.eurasipjournals.com/content/2014/1/2 calculation. The low-rank problem of dimension (s − 1) is caused by this design.

Extension of classical LDA
We first introduce a method to extend the original LDA approach and then generalize it to the tensor space to form the GTR1DA method. The original LDA method exhibits the small-size problem, the (s−1) low-rank problem, and the heteroscedastic problem. Most importantly, the unreasonable between-class scatter matrix does not consider the distribution of the data points nearing the classification boundary. To overcome these limitations, we introduce the following improvement: where A i and A j are points of classes i and j , respectively. The term c uv represents the mean of u and v. The matrix S uv used here with only two vectors is similar to the case of S b , in which there are only two classes of data. Its rank is 1, and its eigenvector corresponding to the nonzero eigenvalue is the connection of the two data centers. This approach involves considering each pair of points from the two classes. The final projection direction is the sum of the angle between the direction and the connection of each pair of points.
From a geometrical perspective, the projection direction is determined by the angle it makes with the connection of each point pair of the various classes: cos(θ) xuv is the angle between x and the connection of u and v. Evidently, features in the feature space have a distribution that can be highly scattered. This method, which involves calculating pairs of points from pairs of classes, considers the spatial distribution of each class. It is superior to the LDA approach, which only considers the centers of each class. Conversely, a better classification plane is determined based on the closer point pairs. Closing point pair means the point pairs which is closer to the classification boundary of each two classes. Therefore, a natural extension of this method consists of assigning a weight to each point pair, depending on the distance between them: the term d uv is the distance between u and v. The easiest method for determining a weight is to use the reciprocal of a distance 1 d uv , which can be expressed in the following form: The two approaches can be combined as follows: For example, we can assign 1 or d uv −1 to the point pairs as the weight for which the distance is in the range of ). Because data points of different classes might overlap, we added a weight to the interval such as (2% ∼ 10%), depending on the distance between the data pair of different types. Therefore, our approach involves adding nonzero weight to data pairs close to each other but excludes the nearest and farthest pairs. Thus, the distribution of closing point pairs is considered to calculate the projection vector, and the time cost is also effectively reduced. In the experiment, this approach produced a considerably strong performance. Hence, a more reasonable classification plane and projection direction were achieved. Another benefit of our method is that it enables the (s − 1) low-rank problem to be overcome. The rank of S oo is expressed as follows: Here, the rank is an upper bound and is affected only when data points are in one line passing through the origin. For real data, this condition is rare, and therefore, the rank of the matrix is close to the upper bound if the data point count is low. Otherwise, the matrix is full rank. http://asp.eurasipjournals.com/content/2014/1/2 The above-mentioned method can be used to solve the s − 1 low-rank problem, the heteroscedastic problem, the problem of the unreasonable between-class scatter matrix, and the problem of the data distribution between classes. However, the small-size problem remains to be solved. Classical LDA is a multiobjective optimization problem that can be used to simultaneously optimize the following equations: arg max A multiobjective optimization problem can be solved by optimizing several equations simultaneously or by combining multiple optimization targets. Classical LDA combines the maximization and minimization problems by dividing the first by the second, which causes the small-size problem. Therefore, the S w must be full rank. However, changing the division to subtraction yields arg max Thus, the problem can be easily solved using a method similar to classical PCA and maximum scatter difference discriminant analysis [21,22]. We performed an eigenvalue decomposition on the following matrix: We ordered the eigenvectors according to their corresponding eigenvalues. The ordered eigenvectors were the expected projection vectors, which represent the solution for the objective function. Other studies have followed a similar strategy [21,22].

Tensor rank one discriminant analysis
TR1DA is an extension of the original LDA approach from the vector space to the tensor space [14]. This method is used for feature extraction and dimension reduction. Let X k ij denote the jth (1 ≤ j ≤ n i ) training sample (tensor) in the ith (1 ≤ i ≤ c) class and k be the kth extracted feature in the training procedure. There is a total of n = c i=1 n i training samples. We denote the mean tensor of the ith class in the kth iteration by M k i = 1 The jth projection vector in the kth iteration is defined by u j k . In TR1DA the kth feature extraction iteration is defined by In the TR1DA case, there is no close form solution for the optimal projection vectors u d k | M d=1 . Thus, alternate projection methods are applied in order to overcome this issue. In fact, after getting the projection vectors, every tensor can be transformed into a vector in the feature space spanned by u l k | 1≤l≤M 1≤k≤R ; the corresponding value for the kth dimension is given by where Therefore, each tensor is mapped into an R-dimensional vector λ 1 , λ 2 , . . . , λ R . Once extracting feature from the tensor data is completed, the classification in the feature space becomes a much simpler task.
The benefits of TR1DA include [14] (1) a natural way of representing data without losing the structure information, (2) the small sample size problem is easier to solve as through the conventional discriminant learning the number of training samples becomes much smaller than the dimension of the feature space, (3) a better convergence during the training procedure, and (4) The (C-1) lowrank problem is partially solved using a strategy similar to the one used for the complementary space LDA approach which searches for the next projection vector in the subspace orthogonal to the space spanned by the projection vectors [23,24].
The main limitations of the tensor rank one discriminant analysis are (1) the heteroscedastic problem, (2) the problem of the unreasonable between-class scatter matrix, and (3) the problem of data distribution between classes not being considered.
7 Generalized tensor rank one discriminant analysis Similarly, we define GTR1DA by replacing x and c by X and M, respectively: Here, M k is the mean tensor of all the means of each class M k i . k is the kth projection tensor calculation iteration: Here, M j 1 j 2 represents the mean of X j 1 and X j 2 . Adding G to the TR1DA equation, the target function of GTR1DA becomes u l k is the l mode projection vector of the projection tensor k. Since the above target function has no close form solution, an alternative projection method is adopted, and G becomes: Replacing the equation in the bracket yields Similarly, the target function of TR1DA can also be rewritten as The optimization problem of GTR1DA is transformed into an m-subproblem, where m is the mode count of the original data: It is common that features in the feature space display a scattered distribution. The above method, which calculates pairs of points from pairs of classes, considers the spatial distribution of each class. These additional information renders it better than the LDA approach where only the center of each class is considered. On the other hand, a better classification plane can be determined if more closing point pairs are used. Thus, a natural extension to the above method is to assign a weight to each point depending on its position: Adding a weight to the above equation leads to (38) http://asp.eurasipjournals.com/content/2014/1/2 The simplest way of determining a weight is to use the inverse of the distance or by combining the two previous approaches: Here, the method is similar to the one used to extend the classical LDA (Equations 18 and 19).
The main idea behind this geometrical approach is to fully consider the distribution of the data points of the different classes when they are close from one another while excluding their influence when they are far from the classification plane.
Algorithm 1 describes the GTR1DA procedure.

Algorithm 1 Generalized tensor rank one discriminant analysis
Require: Input: Training tensors X ij , 1 ≤ c, 1 ≤ j ≤ n i , the number R of rank one tensors allowed in GTR1DA Output: The projection vectors u l k , 1 ≤ l ≤ M, for t = 1 to L do 4: for l = 1 to M do 5: update X k i,j according to Equations 27 and 28 6: calculate the class mean tensor M k

Justification of generalized tensor rank one discriminant analysis
As mentioned earlier (Section 4), the original LDA method has some major drawbacks since the small-size problem, the (C-1) low-rank problem, the heteroscedastic problem, and the unreasonable between-class scatter matrix need to be solved. Being an extension of LDA, TR1DA presents the same defects. Our new approach brings a new light on the matter as it does not suffer from these downsides. The most important characteristic of our approach is that it considers the distribution characteristics of the original data and of the adjacent points over the different classes. Another important aspect of our algorithm is the convergence. In our approach, we adopt an alternative projection method. In the end, it can be proved that our algorithm is monotonic and that the target function for each iteration is monotonically decreasing. We define the target function using the mode and the iteration numbers: where k is the kth projection tensor and l is the mode number. Our algorithm generates a sequence of target function values for each mode l and projection tensor count k. The sequence is as follows: The alternating projection algorithm is actually a composition of M subalgorithms. To check the convergence at each step and know whether or not to stop the algorithm, we calculate the value of the following equation and compare it to a given threshold: We use this method to determine whether the algorithm converges and to terminate the entire algorithm.

Optimal calculation of tensor learning approaches
In general, tensor learning algorithms use alternative optimization algorithms to calculate the result. The required number of iterations tends to be largely determined by the choice of the initial value. Therefore, we proposed an initial value selection algorithm to improve the computational efficiency of the tensor learning algorithm.
Since the original optimization problem is non-convex, it is possible to fall into the non-optimal local solution. Choosing a good initial value greatly affects the convergence property and hence the final result too. The general tensor method often assigns a randomly generated value or 1 to the initial value. This approach is often far from satisfactory. Since the goal is to compute an optimal solution in the rank 1 tensor space, as close as possible to the optimal solution of the original problem, we choose the initial value to be the closest rank 1 tensor to the optimal solution of the original vector space problem: (a (1) , . . . , a (N) Here, the method for this problem is alternating least squares. The basic idea behind this algorithm is similar to the one used in supervised tensor learning. When calculating the value of a certain mode, we fix the other modes. A problem of this form is convex and therefore easy to solve. The equation is as follows [25][26][27]: Expanding the equation yields = min a (n) Z (n) − a (n) a (N) ⊗ · · · ⊗ a (n−1) ⊗ × a (n+1) ⊗ · · · ⊗ a (1) T 2 . (48) ⊗ stands for the Kronecker product and Z (n) means transforming a tensor Z into a matrix corresponding to mode n. The solution to this problem is

Classification
We take the popular and high-performance SVM [28] as our classifier. SVM tries to find the decision boundary that gives the smallest generalization error by maximizing the margin. Consider a classification task for two classes: x 1 , x 2 , . . . , x N are N samples in the training set, and t 1 , t 2 , . . . , t N are the corresponding label, where t n ∈ {−1, +1}. As the data is not always separable linearly, the soft margin method is proposed as a modified maximum margin idea that allows for mislabeled examples, i.e., it allows some of the training set data points to be misclassified. The goal now is to maximize the margin while softly penalizing points that lie on the wrong side of the margin boundary. This forms the following primal optimization problem: where the parameter C > 0 controls the trade-off between the slack variable penalty and the margin.The corresponding Lagrangian is given by where {a n ≥ 0} and {μ n ≥ 0} are Lagrange multipliers, and its dual Lagrangian is in the form n=1 a n a m t n t m k(x n , x m ) (52) with the constraints 0 ≤ a n ≤ C and N n=1 a n t n = 0, and k(x, x ) = φ(x) T φ(x ) is the kernel function. This is known as the kernel trick. It represents the inner product in the feature space. Kernels allow us to use feature spaces of high, even infinite, dimensionality. Polynomial kernel and Gaussian kernel are commonly used kernel functions. In our work, we select Gaussian kernel as the kernel function for its better performance.
The support vector machine is fundamentally a twoclass classifier. In practice, however, there are various categories of heart diseases, so we have to tackle problems involving K > 2 classes of ECG signals. Various methods have been proposed to build a multiclass classifier from a set of two-class SVMs. Some commonly used approaches include 'one-versus-the-rest', 'one-versus-one', DAGSVM, and binary tree. Also, we take the one-versus-one strategy in our approach. Hsu and Lin [29] give a detailed comparison demonstrating that one-versus-one is a competitive strategy. In this approach, totally K(K − 1)/2 different 2-class SVMs on all possible pairs of classes are trained. The test points are then classified to the class that has the highest number of 'votes'.

12-Lead ECG database
To evaluate the performance of our method, we test it on a large dataset provided by a local hospital, courtesy of the SiWei medical company (Shanghai, China) and SiWei Remote ECG diagnostic center. This data is the real diagnostic data generated by a regular medical diagnostic system. The entire database accumulated by the SiWei Remote ECG diagnostic center covers 3 years and consists http://asp.eurasipjournals.com/content/2014/1/2 of 98,287 pieces of ECG data. Each piece is approximately a 20-s long diagnosis; the sampling rate is 500 Hz. The data is a standard 12-lead signal which includes channels I, II, III, aVR, aVL, aVF, V1, V2, V3, V4, V5, and V6. I, II, and III are limb leads. aVR, aVL, and aVF are augmented limb leads, while V1, V2, V3, V4, V5, and V6 are chest leads. The database consists of 1,251 kinds of single or mixed diseases. There are 249 single-disease categories. The dataset used in our experiment is a subset of the whole database. It consists of 3,000 pieces of highquality 12-lead ECG records. Each piece includes 10 to 25 beats, for a total of 65,716 beats. These records originate from a wide variety of people: men, women, young, old, healthy, and unhealthy. The doctors' diagnoses are taken as label for the beats; it is one of the following six types: Normal beat (N), left bundle branch block beat (L), right bundle branch block beat (R), left ventricular hypertrophy (V), sinus bradycardia (S), and electrical axis left side (E). Once the raw ECG signal has been preprocessed, we get the following single heartbeat segments: 19,400 N type; 7,056 L type; 10,080 R type; 6,720 V type; 14,540 S type; and 7,920 E type. a We then split the dataset into two categories: the training set and the test set. We use the former to generate the projection tensors and train the SVM model. The models are then used for the classification of the testing dataset. We randomly split the original dataset into two subsets: the training set consists of 20,000 beats, and the 45,716 remaining beats are used for testing. The ratio are 1:3 and 2:3 of the total data for the training and testing sets, respectively. The size of the training set and of the testing set for each specific type are listed in Table 1. Figure 2 presents an example of a 12-lead ECG signal found in our database. In the experiment, the set is randomly split; the listed result is an average over several experiments.

Results on dataset
We first compare the convergence of TR1DA to our extended version GTR1DA. The result is shown in Figure 3a. Note that the pattern shape of the two methods look alike; the same observation also applies to the shape of the curves. For the feature of the 5th, 8th, 12th, and 19th dimension, both GTR1DA and TR1DA need more iteration steps. However, by comparing the two methods, it is obvious that GTR1DA needs less steps to convergence than the original TR1DA method. Overall, the number of iterations required by our new method is about 1:3 of the original TR1DA. In our experiments, the iteration count threshold was fixed to 30. Therefore, when the iteration count reaches 30, in the TR1DA case, it may indicate that the algorithm did not converge. In fact, during the experiment, if TR1DA does not converge within 30 steps, it iterates with a small gap and, in most cases, cannot converge even with more steps.
On Figure 3b, the Frobenius norm of the difference of a two-step projection tensor is shown. GTR1DA appears to converge very fast and very smoothly, without any fluctuations or sudden increase in the process. It remains substantially and monotonically decreasing; on the contrary, TR1DA starts fluctuating after a few steps. Despite this behavior, it sometimes achieves good convergence, although it always requires more iteration steps than GTR1DA. It can also happen that TR1DA fails to converge after 30 iterations while only fluctuating in a very small range. The number following the method name in the diagram represents the total number of iterations. Figure 3c shows a statistics of the iteration count of GTR1DA and TR1DA. GTR1DA displays an evident advantage over TR1DA.
In addition, since both TR1DA and GTR1DA are tensor learning approaches, the selection of a proper initial value has a great impact on the convergence of the algorithm. Figure 4 shows the improvement brought by our optimal initial value selection method. It compares the number of iterations when run with a random initial value and with our optimal initial value. The difference in the final result is quite large as in the random case. The number of iterations fluctuates between four and seven, against three and four using our optimal initial value. Figure 4 shows how important a proper choice for the initial value is. Indeed, it not only improves the convergence speed but also decreases the number of iterations, leading to a much higher computational efficiency. The iteration step count is more stable: fluctuating less and displaying only small variations. Figure 5 shows the time cost of calculating the initial value. Actually, a good initial value sometimes leads to a better solution. Here, we compare the target function value of the calculated initial value and the random initial value in Figure 6. Sometimes, the calculated initial value leads to a better solution. In GTR1DA, we consider the closing points of different classes to calculate the optimal projection tensor. In fact, choosing an optimal proportion requires a trade-off between efficiency and effectiveness. A lower proportion leads to a higher efficiency as in Figure 7, but at the same time, the effectiveness follows a curve which depends on the point proportion. Therefore, using either a too-high or a too-low proportion can lower the classification accuracy. Table 2 compares the classification accuracy for different point proportions. It lists each class accuracy and average accuracy for the following m% values in Equation 19: 0.5%, 1.2%, 1.5%, 5.5%, and 8%. To prevent the overlapping point pairs, the n% is set with 0.2%. Through this comparison, we observe that the 1.2% proportion results in the highest classification accuracy.
On the other hand, the proportion can greatly influence the time cost. Figure 7 compares the time cost of one loop for the proportions 0.5%, 1.2%, 1.5%, 5.5%, and 8%. It is clear on the figure that a higher proportion implies a  higher time cost and greater fluctuations and variations. One loop using 8% takes from 1,200 to 1,500 s, while with 5.5%, it is between 800 and 1,000 s. It even drops down to 120, 50, and 20 s for 1.5% 1.2%, and 0.5%, respectively. As the fluctuations in the latter two cases remain small, we need to reduce the proportion of the points as much as possible in order to improve the computational efficiency while maintaining a high classification accuracy. Figure 8 shows the convergence process with respect to the time cost. Figure 9 represents the statistics of a onefeature calculation time. In this case, the closing point proportion is 1.2%. Also, note that the time cost remains acceptable compared to that in TR1DA.
As for the ECG signal processing, the most important is the classification accuracy. Therefore, we compared the following different methods over the dimensions 1 to 20 in Figure 10: GTR1DA, TR1DA, UMPCA, PCA, ICA, LDA, and regularized LDA. As for rLDA, PCA, ICA, and LDA, the tensor data is first expanded into a vector, and then the projection vectors are calculated based on the vectorized data. Finally, the top 20 projection vectors are selected to calculate the features exploited in the classification. Figure 10 highlights the superiority of the tensor-based approach over the vector space-based methods. The main reason is that tensor-based approaches take the tensor  data directly as input, preventing SSS problem [7]. Moreover, it enables the utilization of the structural information in the tensor data. The original TR1DA displays a higher classification accuracy under low dimension conditions; however, for higher dimensions, GTR1DA outperforms TR1DA. Both perform better than UMPCA though. As for the vector space methods, from the best to the worst, they are classified as follows for the lower dimension case: LDA, rLDA, PCA, and ICA. They are classified as follows for higher dimensions: PCA, rLDA, ICA, and LDA. ICA displays a sudden accuracy increase after dimension 15. Table 3 presents a clear comparison of the different methods depending on the classification accuracy of each class. It is obvious that TR1DA and GTR1DA have higher accuracy and outperform the others. In addition, our method GTR1DA performs better than the original TR1DA. Four classes of ECG have a 100% accuracy, so it is obvious that our extension impacts the experiments. To further illustrate the effectiveness of our approach and show a clear illustration of better class separation, we extract three dimension features of three classes using some feature extraction methods and plot their 3D distribution ( Figure 11).
In Figure 11, it is clear that GTR1DA, TR1DA, and LDA display cluster characteristics. This is to be contrasted by other methods such as PCA, ICA, and UMPCA. Both LDA and TR1DA seem to display a better cluster characteristic; however, there is an obvious overlap at the boundary of each two classes. Our method calculates the projection tensors based on the data distribution nearing the boundary of each two classes. That explains why we achieve an almost 100% accuracy. Hence, in the end, GTR1DA features a better classification accuracy than the other two methods.
To analyze the performance of our approach, we compare the classification accuracy of the wavelet transform approach [1], of the neural networks approach for determining the features of a given ECG signal [2], of the active learning method [3], and of GTR1DA. The results are shown in Table 4. Our method clearly displays an advantage over all the other methods.

Conclusions
To solve the feature extraction, dimension reduction, and classification problems of large 12-lead hospital standard ECG datasets, a new tensor-based scheme was proposed.
To determine the most discriminative feature, the 12lead ECG was first transformed into a tensor in the spatial-spectral-temporal domain using STFT. The primary objective of extending the original TR1DA method was to extract the effective feature using the tensor data as direct inputs to overcome the SSS problem and take advantage of the structure information contained therein. In addition, our extension provides a new method for solving the heteroscedastic problem and the unreasonable http://asp.eurasipjournals.com/content/2014/1/2 between-class scatter matrix problem by considering the distribution of closing points near the boundary of the various classes. To improve the calculation speed and stability, an optimal calculation method for tensor learning approaches was also proposed to compute an optimal initial value. The results of the experiment showed that our new tensor-based scheme outperforms traditional vectorbased methods. In addition, classification is much more accurate when using GTR1DA than when using the original TR1DA and UMPCA methods. The experiment also shows a critical performance improvement caused by a judicious choice of the initial value. This indicates that GTR1DA, and tensor-based schemes in general, are well fitted, effective, and robust data exploration tools for classifying 12-lead ECG signals.
Endnote a We will make the ECG data public. Please see the website http://bcmi.sjtu.edu.cn/ehealth/.