Sliding Window Generalized Kernel Afﬁne Projection Algorithm Using Projection Mappings

Very recently, a solution to the kernel-based online classiﬁcation problem has been given by the adaptive projected subgradient method (APSM). The developed algorithm can be considered as a generalization of a kernel a ﬃ ne projection algorithm (APA) and the kernel normalized least mean squares (NLMS). Furthermore, sparsiﬁcation of the resulting kernel series expansion was achieved by imposing a closed ball (convex set) constraint on the norm of the classiﬁers. This paper presents another sparsiﬁcation method for the APSM approach to the online classiﬁcation task by generating a sequence of linear subspaces in a reproducing kernel Hilbert space (RKHS). To cope with the inherent memory limitations of online systems and to embed tracking capabilities to the design, an upper bound on the dimension of the linear subspaces is imposed. The underlying principle of the design is the notion of projection mappings. Classiﬁcation is performed by metric projection mappings, sparsiﬁcation is achieved by orthogonal projections, while the online system’s memory requirements and tracking are attained by oblique projections. The resulting sparsiﬁcation scheme shows strong similarities with the classical sliding window adaptive schemes. The proposed design is validated by the adaptive equalization problem of a nonlinear communication channel, and is compared with classical and recent stochastic gradient descent techniques, as well as with the APSM’s solution where sparsiﬁcation is performed by a closed ball constraint on the norm of the classiﬁers.


INTRODUCTION
Kernel methods play a central role in modern classification and nonlinear regression tasks and they can be viewed as the nonlinear counterparts of linear supervised and unsupervised learning algorithms [1][2][3].They are used in a wide variety of applications from pattern analysis [1][2][3], equalization or identification in communication systems [4,5], to time series analysis and probability density estimation [6][7][8].
A positive-definite kernel function defines a high-or even infinite-dimensional reproducing kernel Hilbert space (RKHS) H , widely called feature space [1-3, 9, 10].It also gives a way to map data, collected from the Euclidean data space, to the feature space H .In such a way, processing is transfered to the high-dimensional feature space, and the classification task in H is expected to be linearly separable according to Cover's theorem [1].The inner product in H is given by a simple evaluation of the kernel function on the data space, while the explicit knowledge of the feature space H is unnecessary.This is well known as the kernel trick [1][2][3].
We will focus on the two-class classification task, where the goal is to classify an unknown feature vector x to one of the two classes, based on the classifier value f (x).The online setting will be considered here, where data arrive sequentially.If these data are represented by the sequence (x n ) n≥0 ⊂ R m , where m is a positive integer, then the objective of online kernel methods is to form an estimate of f in H given by a kernel series expansion: where κ stands for the kernel function, (x n ) n≥0 parameterizes the kernel function, (γ n ) n≥0 ⊂ R, and we assume, of course, that the right-hand side of (1) converges.
A convex analytic viewpoint of the online classification task in an RKHS was given in [11].The standard classification problem was viewed as the problem of finding a point in a closed half-space (a special closed convex set) of H . Since data arrive sequentially in an online setting, online classification was considered as the task of finding a point in the nonempty intersection of an infinite sequence of closed half-spaces.A solution to such a problem was given by the recently developed adaptive projected subgradient method (APSM), a convex analytic tool for the convexly constrained asymptotic minimization of an infinite sequence of nonsmooth, nonnegative convex, but not necessarily differentiable objectives in real Hilbert spaces [12][13][14].It was discovered that many projection-based adaptive filtering [15] algorithms like the classical normalized least mean squares (NLMS) [16,17], the more recently explored affine projection algorithm (APA) [18,19], as well as more recently developed algorithms [20][21][22][23][24][25][26][27][28] become special cases of the APSM [13,14].In the same fashion, the present algorithm can be viewed as a generalization of a kernel affine projection algorithm.
To form the functional representation in (1), the coefficients (γ n ) n≥0 must be kept in memory.Since the number of incoming data increases, the memory requirements as well as the necessary computations of the system increase linearly with time [29], leading to a conflict with the limitations and complexity issues as posed by any online setting [29,30].Recent research focuses on sparsification techniques, that is, on introducing criteria that lead to an approximate representation of (1) using a finite subset of (γ n ) n≥0 .This is equivalent to identifying those kernel functions whose removal is expected to have a negligible effect, in some predefined sense, or, equivalently, building dictionaries out of the sequence (κ(x n , •)) n≥0 [31][32][33][34][35][36].
To introduce sparsification, the design in [30], apart from the sequence of closed half-spaces, imposes an additional constraint on the norm of the classifier.This leads to a sparsified representation of the expansion of the solution given in (1), with an effect similar to that of a forgetting factor which is used in recursive-least-squares-(RLS-) [15] type algorithms.
This paper follows a different path to the sparsification in the line with the rationale adopted in [36].A sequence of linear subspaces (M n ) n≥0 of H is formed, by using the incoming data together with an approximate linear dependency/independency criterion.To satisfy the memory requirements of the online system, and in order to provide with tracking capabilities to our design, a bound on the dimension of the generating subspaces (M n ) n≥0 is imposed.This upper bound turns out to be equivalent to the length of a memory buffer.Whenever the buffer becomes full and each time a new data enters the system, an old observation is discarded.Hence, an upper bound on dimension results into a sliding window effect.The underlying principle of the proposed design is the notion of projection mappings.Indeed, classification is performed by metric projection mappings, sparsification is conducted by orthogonal projections onto the generated linear subspaces (M n ) n≥0 , and memory limitations (which lead to enhanced tracking capabilities) are established by employing oblique projections.Note that although the classification problem is considered here, the tools can readily be adopted for regression tasks, with different cost functions that can be either differentiable or nondifferentiable.
The paper is organized as follows.Mathematical preliminaries and elementary facts on projection mappings are given in Section 2. A short description of the convex analytic perspective introduced in [11,30] is presented in Sections 3 and 4, respectively.A byproduct of this approach, a kernel affine projection algorithm (APA), is introduced in Section 4.2.The sparsification procedure based on the generation of a sequence of linear subspaces is given in Section 5. To validate the design, the adaptive equalization problem of a nonlinear channel is chosen.We compare the present scheme with the classical kernel perceptron algorithm, its generalization, the NORMA method [29], as well as the APSM's solution but with the norm constraint sparsification [30] in Section 7. In Section 8, we conclude our discussion, and several clarifications as well as a table of the main symbols, used in the paper, are gathered in the appendices.

MATHEMATICAL PRELIMINARIES
Henceforth, the set of all integers, nonnegative integers, positive integers, real and complex numbers will be denoted by Z, Z ≥0 , Z >0 , R and C, respectively.Moreover, the symbol card(J) will stand for the cardinality of a set J, and j 1 , j 2 := { j 1 , j 1 + 1, . . ., j 2 }, for any integers j 1 ≤ j 2 .

Reproducing kernel Hilbert space
We provide here with a few elementary facts about reproducing kernel Hilbert spaces (RKHS).The symbol H will stand for an infinite-dimensional, in general, real Hilbert space [37,38] equipped with an inner product denoted by •, • .The induced norm in H will be given by f := f , f 1/2 , for all f ∈ H .An example of a finite-dimensional real Hilbert space is the well-known Euclidean space R m of dimension m ∈ Z >0 .In this space, the inner product is nothing but the vector dot product x 1 , x 2 := x t 1 x 2 , for all x 1 , x 2 ∈ R m , where the superscript (•) t stands for vector transposition.
Assume a real Hilbert space H which consists of functions defined on R m , that is, belongs to H , (2) the reproducing property holds, that is, In this case, H is called a reproducing kernel Hilbert space (RKHS) [2,3,9].If such a function κ(•, •) exists, it is unique [9].A reproducing kernel is positive definite and symmetric in its arguments [9].(A kernel κ is called positive definite if N l, j=1 ξ l ξ j κ(x l , x j ) ≥ 0, for all ξ l , ξ j ∈ R, for all x l , x j ∈ R m , and for any N ∈ Z >0 [9].This property underlies the kernel functions firstly studied by Mercer [10].)In addition, the Moore-Aronszajn theorem [9] guarantees that to every positive definite function κ(•, •) : R m × R m → R there corresponds a unique RKHS H whose reproducing kernel is κ itself [9].Such an RKHS is generated by taking first the space of all finite combinations j γ j κ(x j , •), where γ j ∈ R, x j ∈ R m , and then completing this space by considering also all its limit points [9].Notice here that, by (2), the inner product of H is realized by a simple evaluation of the kernel function, which is well known as the kernel trick [1,2]; There are numerous kernel functions and associated RKHS H , which have extensively been used in pattern analysis and nonlinear regression tasks [1][2][3].Celebrated examples are (i) the linear kernel κ(x, y) := x t y, for all x, y ∈ R m (here the RKHS H is the data space R m itself), and (ii) the Gaussian or radial basis function (RBF) kernel κ(x, y) := exp(−((x − y) t (x − y))/2σ 2 ), for all x, y ∈ R m , where σ > 0 (here the associated RKHS is of infinite dimension [2,3]).
For more examples and systematic ways of generating more involved kernel functions by using fundamental ones, the reader is referred to [2,3].Hence, an RKHS offers a unifying framework for treating several types of nonlinearities in classification and regression tasks.

Closed convex sets, metric, orthogonal, and oblique projection mappings
H and for all λ ∈ (0, 1) we have Given any point f ∈ H , we can quantify its distance from a nonempty closed convex set C by the metric distance [37,38], where inf denotes the infimum.The function d(•, C) is nonnegative, continuous, and convex [37,38].Note that any point f ∈ C is of zero distance from C, that is, d( f , C) = 0, and that the set of all minimizers of Given a point f ∈ H and a closed convex set C ⊂ H , an efficient way to move from f to a point in C, that is, to a minimizer of d(•, C), is by means of the metric projection mapping P C onto C, which is defined as the mapping that takes f to the uniquely existing point P C ( f ) of C that achieves the infimum value [37,38].For a geometric interpretation refer to Figure 1.Clearly, if f ∈ C then P C ( f ) = f .A well-known example of a closed convex set is a closed linear subspace M [37,38] of a real Hilbert space H .The metric projection mapping P M is called now orthogonal projection since the following property holds: [37,38].Given an f ∈ H , the shift of a closed linear subspace M by f , that is, is called an (affine) linear variety [38].
Given a / = 0 in H and ξ ∈ R, let a closed half-space be the closed convex set Π + := { f ∈ H : a, f ≥ ξ}, that is, Π + is the set of all points that lie on the "positive" side of 0 the hyperplane Π := { f ∈ H : a, f = ξ}, which defines the boundary of Π + [37].The vector a is usually called the normal vector of Π + .The metric projection operator P Π + can easily be obtained by simple geometric arguments, and it is shown to have the closed-form expression [37,39]: where τ + := max{0, τ} denotes the positive part of a τ ∈ R.
Given the center f 0 ∈ H and the radius δ > 0, we define the closed ball [37].The closed ball B[ f 0 , δ] is clearly a closed convex set, and its metric projection mapping is given by the simple formula: for all f ∈ H , which is the point of intersection of the sphere and the segment joining f and the center of the sphere in the case where f / ∈ B[ f 0 , δ] (see Figure 1).Let, now, M and M be linear subspaces of a finitedimensional linear subspace V ⊂ H .Then, let M + M be defined as the subspace M +M : If also M ∩ M = {0}, then M + M is called the direct sum of M and M and is denoted by M ⊕ M [40,41].In the case where V = M ⊕ M , then every f ∈ V can be expressed uniquely as a sum f = h + h , where h ∈ M and h ∈ M [40,41].Then, we define here a mapping P M,M : V = M ⊕ M → M which takes any f ∈ V to that unique h ∈ M that appears in the decomposition f = h + h .We will call h the (oblique) projection of f on M along M [40] (see Figure 1).

CONVEX ANALYTIC VIEWPOINT OF KERNEL-BASED CLASSIFICATION
In pattern analysis [1,2], data are usually given by a sequence of vectors (x n ) n∈Z≥0 ⊂ X ⊂ R m , for some m ∈ Z >0 .We will assume that each vector in X is drawn from two classes and is thus associated to a label y n ∈ Y := {±1}, n ∈ Z ≥0 .As such, a sequence of (training) pairs To benefit from a larger than m or even infinitedimensional space, modern pattern analysis reformulates the classification problem in an RKHS H (implicitly defined by a predefined kernel function κ), which is widely known as the feature space [1][2][3].A mapping φ : R m → H which takes (x n ) n∈Z≥0 ⊂ R m onto (φ(x n )) n∈Z≥0 ⊂ H is given by the kernel function associated to the RKHS feature space H : φ(x) := κ(x, •) ∈ H , for all x ∈ R m .Then, the classification problem is defined in the feature space H as selecting a point f ∈ H and an offset b ∈ R such that y( f (x) + b) ≥ ρ, for all (x, y) ∈ D, and for some margin ρ ≥ 0 [1,2].
For convenience, we merge f ∈ H and b ∈ R into a single vector u := ( f , b) ∈ H × R, where H × R stands for the product space [37,38] of H and R. Henceforth, we will call a point u ∈ H × R a classifier, and H × R the space of all classifiers.The space H × R of all classifiers can be endowed with an inner product as follows: for any The space H × R of all classifiers becomes then a Hilbert space.The notation •, • will be used for both •, • H×R and •, • H .A standard penalty function to be minimized in classification problems is the soft margin loss function [1,29] defined on the space of all classifiers H × R as follows: given a pair (x, y) ∈ D and the margin parameter ρ ≥ 0, where the function g f ,b is defined by If the classifier u := ( f , b) is such that yg f , b (x) < ρ, then this classifier fails to achieve the margin ρ at (x, y) and (5) scores a penalty.In such a case, we say that the classifier committed a margin error.A misclassification occurs at (x, y) The studies in [11,30] approached the classification task from a convex analytic perspective.By the definition of the classification problem, our goal is to look for classifiers (points in H × R) that belong to the set Π + x,y,ρ : Thus, for a given pair (x, y) and a margin ρ, by the definition of the inner product •, • H×R , the set of all desirable classifiers (that do not commit a margin error at (x, y)) is where a x,y := (yκ(x, •), y) = y(κ(x, •), 1) ∈ H × R. The vector (κ(x, •), 1) ∈ H × R is an extended (to account for the constant factor b) vector that is completely specified by the point x and the adopted kernel function.By (7), we notice that Π + x,y,ρ is a closed half-space of H × R (see Section 2.2).That is, all classifiers that do not commit a margin error at (x, y) belong in the closed half-space Π + x,y,ρ specified by the chosen kernel function.
The following proposition builds the bridge between the standard loss function l x,y,ρ and the closed convex set Π + x,y,ρ .
Starting from this viewpoint, the following section describes shortly a convex analytic tool [11,30] which tackles the online classification task, where a sequence of parameters (x n , y n , ρ n ) n∈Z≥0 , and thus a sequence of closed half-spaces (Π + xn,yn,ρn ) n∈Z≥0 , is assumed.

THE ONLINE KERNEL-BASED CLASSIFICATION TASK AND THE ADAPTIVE PROJECTED SUBGRADIENT METHOD
At every time instant n ∈ Z ≥0 , a pair (x n , y n ) ∈ D becomes available.If we also assume a nonnegative margin parameter ρ n , then we can define the set of all classifiers that achieve this margin by the closed half-space Π + xn,yn,ρn : Clearly, in an online setting, we deal with a sequence of closed half-spaces (Π + xn,yn,ρn ) n∈Z≥0 ⊂ H × R and since each one of them contains the set of all desirable classifiers, our objective is to find a classifier that belongs to or satisfies most of these half-spaces or, more precisely, to find a classifier that belongs to all but a finite number of Π + xn,yn,ρn s, that is, a u ∈ ∩ n≥N0 Π + xn,yn,ρn ⊂ H × R, for some N 0 ∈ Z ≥0 .In other words, we look for a classifier in the intersection of these half-spaces.
The studies in [11,30] propose a solution to the above problem by the recently developed adaptive projected subgradient method (APSM) [12][13][14].The APSM approaches the above problem as an asymptotic minimization of a sequence of not necessarily differentiable nonnegative convex functions over a closed convex set in a real Hilbert space.
Instead of processing a single pair (x n , y n ) at each n, APSM offers the freedom to process concurrently a set {(x j , y j )} j∈Jn , where the index set J n ⊂ 0, n for every n ∈ Z, and where j 1 , j 2 := {j 1 , j 1 + 1, . . ., j 2 } for every integers j 1 ≤ j 2 .Intuitively, concurrent processing is expected to increase the speed of an algorithm.Indeed, in adaptive filtering [15], it is the motivation behind the leap from NLMS [16,17], where no concurrent processing is available, to the potentially faster APA [18,19].
To keep the discussion simple, we assume that n ∈ J n , for all n ∈ Z ≥0 .An example of such an index set J n is given in (13).In other words, (13) treats the case where at time instant n, the pairs {(x j , y j )} j∈n−q+1,n , for some q ∈ Z >0 , are considered.This is in line with the basic rationale of the celebrated affine projection algorithm (APA), which has extremely been used in adaptive filtering [15].
Each pair (x j , y j ), and thus each index j, defines a half-space Π + xj ,yj ,ρ (n) j by (7).In order to point out explicitly the dependence of such a half-space on the index set J n , we slightly modify the notation for Π xj ,yj ,ρ (n) j and use Π + j,n for any j ∈ J n , and for any n ∈ Z ≥0 .The metric projection mapping P Π + j,n is analytically given by (3).To assign different importance to each one of the projections corresponding to J n , we associate to each half-space, that is, to each j ∈ J n , a weight ω (n) j such that ω (n) j ≥ 0, for all j ∈ J n , and j∈Jn ω (n) j = 1, for all n ∈ Z ≥0 .This is in line with the adaptive filtering literature that tends to assign higher importance in the most recent samples.For the less familiar reader, we point out that if J n := {n}, for all n ∈ Z ≥0 , the algorithm breaks down to the NLMS.Regarding the APA, a discussion can be found below.
As it is also pointed out in [29,30], the major drawback of online kernel methods is the linear increase of complexity with time.To deal with this problem, it was proposed in [30] to further constrain the norm of the desirable classifiers by a closed ball.To be more precise, one constrains the desirable classifiers in [30] by for some predefined δ > 0. As a result, one seeks for classifiers that belong to K ∩ ( j∈Jn, n≥N0 Π + j,n ), for ∃N 0 ∈ Z ≥0 .By the definition of the closed ball B[0, δ] in Section 2.2, we easily see that the addition of K imposes a constraint on the norm of f in the vector u = ( f , b) by f ≤ δ.The associated metric projection mapping is analytically given by the simple computation is obtained by (4).It was observed that constraining the norm results into a sequence of classifiers with a fading memory, where old data can be eliminated [30].
For the sake of completeness, we give a summary of the sparsified algorithm proposed in [30].
Algorithm 1 (see [30]).For any n ∈ Z ≥0 , consider the index set J n ⊂ 0, n, such that n ∈ J n .An example of J n can be found in (13).For any j ∈ J n and for any n ∈ Z ≥0 , let the closed half-space j }, and the weight ω (n) j ≥ 0 such that j∈Jn ω (n) j = 1, for all n ∈ Z ≥0 .For an arbitrary initial offset b 0 ∈ R, consider as an initial classifier the point u 0 := (0, b 0 ) ∈ H × R and generate the following point (classifier) sequence in H × R by where the extrapolation coefficient μ n ∈ [0, 2M n ] with Due to the convexity of • 2 , the parameter M n ≥ 1, for all n ∈ Z ≥0 , so that μ n can take values larger than or equal to 2. The parameters that can be preset by the designer are the concurrency index set J n and μ n .The bigger the cardinality of J n , the more closed half-spaces to be concurrently processed at the time instant n, which results into a potentially increased convergence speed.An example of J n , which will be followed in the numerical examples, can be found in (13).In the same fashion, for extrapolation parameter values μ n close to 2M n (μ n ≤ 2M n ), increased convergence speed can be also observed (see Figure 6).
If we define where g n := g fn,bn by ( 6), then the algorithmic process (8a) can be written equivalently as follows: The parameter M n takes the following form after the proper algebraic manipulations: As explained in [30], the introduction of the closed ball constraint B[0, δ] on the norm of the estimates ( f n ) n results into a potential elimination of the coefficients γ n that correspond to time instants close to index 0 in (1), so that a buffer with length N b can be introduced to keep only the most recent N b data (x l ) n l=n−Nb+1 .This introduces sparsification to the design.Since the complexity of all the metric projections in Algorithm 1 is linear, the overall complexity is linear on the number of the kernel function, or after inserting the buffer with length N b , it is of order O(N b ).

Computation of the margin levels
We will now discuss in short the dynamic adjustment strategy of the margin parameters, introduced in [11,30].
For simplicity, all the concurrently processed margins are assumed to be equal to each other, that is, ρ n := ρ (n) j , for all j ∈ J n , for all n ∈ Z ≥0 .Of course, more elaborate schemes can be adopted.
Whenever (ρ n − y j g n (x j )) + = 0, the soft margin loss function l xj ,yj ,ρn in (5) attains a global minimum, which means by Proposition 1 that u n := ( f n , b n ) belongs to Π + j,n .In this case, we say that we have feasibility for j ∈ J n .Otherwise, that is, if u n / ∈ Π + j,n , infeasibility occurs.To describe such situations, let us denote the feasibility cases by the index set J n := {j ∈ J n : (ρ n − y j g n (x j )) + = 0}.The infeasibility cases are obviously J n \ J n .
If we set card(∅) := 0, then we define the feasibility rate as the quantity R (n)  feas := card(J n )/card(J n ), for all n ∈ Z ≥0 .For example, R (n)  feas = 1/2 denotes that the number of feasibility cases is equal to the number of infeasibility ones at the time instant n ∈ Z ≥0 .
If, at time n, R (n)  feas is larger than or equal to some predefined R, we assume that this will also happen for the next time instant n+1, provided we work in a slowly changing environment.More than that, we expect R (n+1)  feas ≥ R to hold for a margin ρ n+1 slightly larger than ρ n .Hence, at time n, if R (n)  feas ≥ R, we set ρ n+1 > ρ n under some rule to be discussed below.On the contrary, if R (n)  feas < R, then we assume that if the margin parameter value is slightly decreased to ρ n+1 < ρ n , it may be possible to have R (n+1) feas ≥ R. For example, if we set R := 1/2, this scheme aims at keeping the number of feasibility cases larger than or equal to those of infeasibilities, while at the same time it tries to push the margin parameter to larger values for better classification at the test phase.

Kernel affine projection algorithm
Here we introduce a byproduct of Algorithm 1, namely, a kernelized version of the standard affine projection algorithm [15,18,19].
Motivated by the discussion in Section 3, Algorithm 1 was devised in order to find at each time instant n a point in the set of all desirable classifiers j∈Jn Π + j,n / = ∅.Since any point in this intersection is suitable for the classification task at time n, any nonempty subset of j∈Jn Π + j,n can be used for the problem at hand.In what follows we see that if we limit the set of desirable classifiers and deal with the boundaries {Π j,n } j∈Jn , that is, hyperplanes (Section 2.2), of the closed half-spaces {Π + j,n } j∈Jn , we end up with a kernelized version of the classical affine projection algorithm [18,19].
Figure 2: For simplicity, we assume that at some time instant n ∈ Z ≥0 , the cardinality card(J n ) = 2.This figure illustrates the closed half-spaces {Π + j,n } 2 j=1 and their boundaries, that is, the hyperplanes In the case where 2 j=1 Π j,n / = ∅, the defined in (11) linear variety becomes V n = 2 j=1 Π j,n , which is a subset of 2 j=1 Π + j,n .The kernel APA aims at finding a point in the linear variety V n , while Algorithm 1 and the APSM consider the more general setting of finding a point in 2 j=1 Π + j,n .Due to the range of the extrapolation parameter μ n ∈ [0, 2M n ] and M n ≥ 1, the APSM can rapidly furnish solutions close to the large intersection of the closed halfspaces (see also Figure 6), without suffering from instabilities in the calculation of a Moore-Penrose pseudoinverse matrix necessary for finding the projection P Vn .
Definition 1 (kernel affine projection algorithm).Fix n ∈ Z ≥0 and let q n := card(J n ).Define the set of hyperplanes {Π j,n } j∈Jn by where a j,n := y j (κ(x j , •), 1), for all j ∈ J n .These hyperplanes are the boundaries of the closed half-spaces {Π + j,n } j∈Jn (see Figure 2).Note that such hyperplane constraints as in (9) are often met in regression problems with the difference that there the coefficients {ρ (n) j } j∈Jn are part of the given data and not parameters as in the present classification task.
Since we will be looking for classifiers in the assumed nonempty intersection j∈Jn Π j,n , we define the function and let the set (see Figure 2) This set is a linear variety (for a proof see Appendix A).
Clearly, if j∈Jn Π j,n / = ∅, then V n = j∈Jn Π j,n .Now, given an arbitrary initial u 0 , the kernel affine projection algorithm is defined by the following point sequence: where the extrapolation parameter μ n ∈ [0, 2], G n is a matrix of dimension q n × q n , where its (i, j)th element is defined by y i y j (κ(x i , x j ) + 1), for all i, j ∈ 1, q n , the symbol † stands for the (Moore-Penrose) pseudoinverse operator [40], and the notation (a 1,n , . . ., a qn,n )λ := qn j=1 λ j a j,n , for all λ ∈ R qn .For the proof of the equality in (12), refer to Appendix A.
Remark 1.The fact that the classical (linear kernel) APA [18,19] can be seen as a projection algorithm onto a sequence of linear varieties was also demonstrated in [26,Appendix B].The proof in Appendix A extends the defining formula of the APA, and thus the proof given in [26, Appendix B], to infinite-dimensional Hilbert spaces.Extending [26], the APSM [12][13][14] devised a convexly constrained asymptotic minimization framework which contains APA, the NLMS, as well as a variety of recently developed projection-based algorithms [20-25, 27, 28].
By Definition 1 and Appendix A, at each time instant n, the kernel APA produces its estimate by projecting onto the linear variety V n .In the special case where q n := 1, that is, J n = {n}, for all n, then (12) gives the kernel NLMS [42].Note also that in this case, the pseudoinverse is simplified to G † n = a n / a n 2 , for all n.Since V n is a closed convex set, the kernel APA can be included in the wide frame of the APSM (see also the remarks just after Lemma 3.3 or Example 4.3 in [14]).Under the APSM frame, more directions become available for the kernel APA, not only in terms of theoretical properties, but also in devising variations and extensions of the kernel APA by considering more general convex constraints than V n as in [26], and by incorporating a priori information about the model under study [14].
Note that in the case where j∈Jn Π j,n / = ∅, then V n = j∈Jn Π j,n .Since Π j,n is the boundary and thus a subset of the closed half-space Π + j,n , it is clear that looking for points in j∈Jn Π j,n , in the kernel APA and not in the larger j∈Jn Π + j,n as in Algorithm 1, limits our view of the online classification task (see Figure 2).Under mild conditions, Algorithm 1 produces a point sequence that enjoys properties like monotone approximation, strong convergence to a point in the intersection K ∩ ( j∈Jn Π + j,n ), asymptotic optimality, as well as a characterization of the limit point.
To speed up convergence, Algorithm 1 offers the extrapolation parameter μ n which has a range of μ n ∈ [0, 2M n ] with M n ≥ 1.The calculation of the upper bound M n is given by simple operations that do not suffer by instabilities as in the computation of the (Moore-Penrose) pseudoinverses (G † n ) n in (12) [40].A usual practice for the efficient computation of the pseudoinverse matrix is to diagonally load some matrix with positive values prior inversion, leading thus to solutions towards an approximation of the original problem at hand [15,40].
The above-introduced kernel APA is based on the fundamental notion of metric projection mapping on linear varieties in a Hilbert space, and it can thus be straightforwardly extended to regression problems.In the sequel, we will focus on the more general view offered to classification by Algorithm 1 and not pursue further the kernel APA approach.

SPARSIFICATION BY A SEQUENCE OF FINITE-DIMENSIONAL SUBSPACES
In this section, sparsification is achieved by the construction of a sequence of linear subspaces (M n ) n∈Z≥0 , together with their bases (B n ) n∈Z≥0 , in the space H .The present approach is in line with the rationale presented in [36], where a monotonically increasing sequence of subspaces (M n ) n∈Z≥0 was constructed, that is, M n ⊆ M n+1 , for all n ∈ Z ≥0 .Such a monotonic increase of the subspaces' dimension undoubtedly raises memory resources issues.In this paper, such a monotonicity restriction is not followed.
To accomodate memory limitations and tracking requirements, two parameters, namely L b and α, will be of central importance in our design.The parameter L b establishes a bound on the dimensions of ( Given a basis B n , a buffer is needed in order to keep track of the L n basis elements.The larger the dimension for the subspace M n , the larger the buffer necessary for saving the basis elements.Here, L b gives the designer the freedom to preset an upper bound for the dimensions (L n ) n , and thus upper-bound the size of the buffer according to the available computational resources.Note that this introduces a tradeoff between memory savings and representation accuracy; the larger the buffer, the more basis elements to be used in the kernel expansion, and thus the larger the accuracy of the functional representation, or, in other words, the larger the span of the basis, which gives us more candidates for our classifier.We will see below that such a bound L b results into a sliding window effect.Note also that if the data {x n } n∈Z≥0 are drawn from a compact set in R m , then the algorithmic procedure introduced in [36] produces a sequence of monotonically increasing subspaces with dimensions upper-bounded by some bound not known a priori.
The parameter α is a measure of approximate linear dependency or independency.Every time a new element κ(x n+1 , •) becomes available, we compare its distance from the available finite-dimensional linear subspace M n = span(B n ) with α, where span stands for the linear span operation.If the distance is larger than α, then we say that κ(x n+1 , •) is sufficiently linearly independent of the basis elements of B n , we decide that it carries enough "new information," and we add this element to the basis, creating a new B n+1 which clearly contains B n .However, if the above distance is smaller than or equal to α, then we say that κ(x n+1 , •) is approximately linearly dependent on the elements of B n , so that augmenting B n is not needed.In other words, α controls the frequency by which new elements enter the basis.Obviously, the larger the α, the more "difficult" for a new element to contribute to the basis.Again, a tradeoff between the cardinality of the basis and the functional representation accuracy is introduced, as also seen above for the parameter L b .
To increase the speed of convergence of the proposed algorithm, concurrent processing is introduced by means of the index set J n , which indicates which closed half-spaces will be processed at the time instant n.Note once again that such a processing is behind the increase of the convergence speed met in APA [18,19] when compared to that of the NLMS [16,17], in classical adaptive filtering [15].Without any loss of generality, and in order to keep the discussion simple, we consider here the following simple case for J n : where q ∈ Z >0 is a predefined constant denoting the number of closed half-spaces to be processed at each time instant n ≥ q − 1.In other words, for n ≥ q − 1, at each time instant n, we consider concurrent projections on the closed half-spaces associated with the q most recent samples.We state now a definition whose motivation is the geometrical framework of the oblique projection mapping given in Figure 1.
Definition 2. Given n ∈ Z ≥0 , assume the finite-dimensional linear subspaces M n , M n+1 ⊂ H with dimensions L n and L n+1 , respectively.Then it is well known that there exists a linear subspace W n , such that M n +M n+1 = W n ⊕M n+1 , where the symbol ⊕ stands for the direct sum [40,41].Then, the following mapping is defined: where P Mn+1,Wn denotes the oblique projection mapping on M n+1 along W n .To visualize this in the case when M n / ⊆ M n+1 , refer to Figure 1, where M becomes M n+1 , and M becomes W n .
To exhibit the sparsification method, the constructive approach of mathematical induction on n ∈ Z ≥0 is used as follows.

At the time instant n ∈ Z >0
We assume, now, that at time n ∈ Z >0 the basis Ln } is available, where L n ∈ Z >0 .Define also the linear subspace M n := span(B n ), which is of dimension L n .
Without loss of generality, we assume that n ≥ q − 1, so that the index set J n := n − q + 1, n is available.Available are also the kernel functions {κ(x j , •)} j∈Jn .Our sparsification method is built on the sequence of closed linear subspaces (M n ) n .At every time instant n, all the information needed for the realization of the sparsification method will be contained within M n .As such, each κ(x j , •), for j ∈ J n , must be associated or approximated by a vector in M n .Thus, we associate to each κ(x j , •), j ∈ J n , a set of vectors {θ (n)  xj } j∈Jn , as follows For example, at time 0, κ(x 0 , •) → k (0) x0 := ψ (0) 1 .Since we follow the constructive approach of mathematical induction, the above set of vectors is assumed to be known.
Available is also the matrix , for all i, j ∈ 1, L n .It can be readily verified that K n is a Gram matrix which, by the assumption that {ψ (n)   l } Ln l=1 are linearly independent, is also positive definite [40,41].Hence, the existence of its inverse K −1 n is guaranteed.We assume here that K −1 n is also available.

At time n + 1, the new data x n+1 becomes available
At time n + 1, a new element κ(x n+1 , •) of H becomes available.Since M n is a closed linear subspace of H , the orthogonal projection of κ(x n+1 , •) onto M n is well defined and given by where the vector given by [37,38] Since K −1 n was assumed available, we can compute Now, the distance d n+1 of κ(x n+1 , •) from M n (in Figure 1 this is the quantity f −P M ( f ) ) can be calculated as follows: In order to derive (19), we used the fact that the linear operator P Mn is selfadjoint and the linearity of the inner product l , for all l ∈ 1, L n .Moreover, M n+1 := span(B n+1 ) = M n .Also, we let K n+1 := K n , and K −1 n+1 := K −1 n .Notice here that J n+1 := n − q + 2, n + 1.The approximations given by (15) have to be transfered now to the new linear subspace M n+1 .To do so, we employ the mapping π n given in Definition 2: for all j ∈ J n+1 \ {n + 1}, k (n+1) . Since, M n+1 = M n , then by (14), As a result, θ (n+1) xn+1 , we use ( 16) and let k (n+1) xn+1 := P Mn (κ(x n+1 , •)).In other words, κ(x n+1 , •) is approximated by its orthogonal projection P Mn (κ(x n+1 , •)) onto M n , and this information is kept in memory by the coefficient vector θ (n+1)  xn+1 := ζ (n+1) xn+1 .

Approximate linear independency (d n+1 > α)
On the other hand, if d n+1 > α, then κ(x n+1 , •) becomes approximately linearly independent on B n , and we add it to our new basis.If we also have L n ≤ L b − 1, then we can increase the dimension of the basis without exceeding the memory of the buffer: L n+1 := L n + 1 and l , for all l ∈ 1, L n , and ψ (n+1)

Approximate linear independency (d n+1 > α)
and buffer overflow (L n + 1 > L b ); the sliding window effect Now, assume that d n+1 > α and that L n = L b .According to the above methodology, we still need to add κ(x n+1 , •) to our new basis, but if we do so the cardinality L n + 1 of this new basis will exceed our buffer's memory L b .We choose here to discard the oldest element ψ (n) 1 in order to make space for κ(x n+1 , •): and the addition of κ(x n+1 , •) results in the sliding window effect.We stress here that instead of discarding ψ (n)  1 , other elements of B n can be removed, if we use different criteria than the present ones.Here, we choose ψ (n)  1 for simplicity, and for allowing the algorithm to focus on recent system changes by making its dependence on the remote past diminishing as time moves on.

THE APSM WITH THE SUBSPACE-BASED SPARSIFICATION
In this section, we embed the sparsification strategy of Section 5 in the APSM.As a result, the following algorithmic procedure is obtained.17) and ( 18), respectively, and the distance d n+1 by (19).

25.
Increase n by one, that is, n ← n + 1 and go to line 2.
Algorithm 2: Sparsification scheme by a sequence of finite-dimensional linear subspaces.Algorithm 3.For any n ∈ Z ≥0 , consider the index set J n defined by (13).For any j ∈ J n and for any n ∈ Z ≥0 , let the closed half-space Π + j,n : For an arbitrary initial offset b 0 ∈ R, consider as an initial classifier the point u 0 := (0, b 0 ) ∈ H × R and generate the following sequences by where π −1 ( f 0 ) := 0, the vectors {θ (n) xj } j∈Jn , for all n ∈ Z ≥0 , are given by Algorithm 2, and where The function g n := g fn, bn , and g is defined by (6).Moreover ρ n is given by the procedure described in Section 4.1.Also, where The following proposition holds.
Proof.See Appendix C. Now that we have a kernel series expression for the estimate f n by (26), we can give also an expression for the quantity π n−1 ( f n ) in (25b), by using also the definition ( 14): That is, whenever M n−1 / ⊆ M n , we remove from the kernel series expansion (26) the term corresponding to the basis element ψ (n−1) 1 .This is due to the sliding window effect and 1. Initialization.Let B 0 := {κ(x 0 , •)}, θ (0)  x0 := 1, γ (0) 1 := 0, J 0 := {0}, and choose for the initial offset b 0 any value in R. Fix α ≥ 0 and L b ∈ Z >0 .

Assume the time instant
Now, the index set J n becomes J n := n − q + 1, n by (13).We already know ) is given by ( 26) and (25c).

8.
Increase n by one, that is, n ← n + 1 and go to line 2.
Algorithm 3: Proposed algorithm.refers to the case of Section 5.3.3.According to our strategy, the case M n−1 / ⊆ M n happens only when approximate linear independency d n > α and a buffer overflow L n−1 + 1 > L b occurs.To prevent this buffer overflow, we have to cut off the term corresponding to ψ (n−1) 1 , and keep an empty position in the buffer in order for the new element κ(x n , •) to contribute to the basis.Having the knowledge of ( 27), the coefficients , for all n ∈ Z ≥0 , will be given by the following iterative formula: let γ (0) 1 := 0, and for all n ∈ Z ≥0 , Our proposed algorithm is summarized as shown in Algorithm 3.
Notice that the calculation of all the metric and oblique projections is of linear complexity with respect to the dimension L n .The main computational load of the proposed algorithm comes from the calculation of the orthogonal projection onto the subspace M n by (18) which is of order O(L 2 n ) where L n is the dimension of M n .Since, however, we have upper bounded L n ≤ L b , for all n ∈ Z ≥0 , it follows that the computational load of our method is upper bounded by O(L 2 b ).

Source Nonlinearity Noise n n
Received signal  3 where the LTI system is set to H 1 .To allow concurrent processing, we let q := card(J n ) := 4, for all n.The variance of the Gaussian kernel takes the value of σ 2 := 0.5.The buffer length L b := 500, and α := 0.5.The average number of basis elements is 110.

NUMERICAL EXAMPLES
An adaptive equalization problem for the nonlinear channel depicted in Figure 3 is chosen to validate the proposed design.The same model was chosen also in [11,30].The sparsification scheme of Section 5 was applied also to the stochastic gradient descent methods of NORMA and kernel perceptron [29].
The source signal (s n ) n is a sequence of numbers taking values from {±1} with equal probability.A linear timeinvariant (LTI) [43] channel follows in order to produce the signal (w n ) n .Available are two transfer functions for the LTI system: , where θ 1 := 29.5 • and θ 2 := −35 • .In such a way, we can test our design under a sudden system change.The transfer functions H l (z) := 2 i=0 h li z −i , z ∈ C, l = 1, 2, were chosen as above in order to simplify computations, since 2 i=0 h 2 li = 1, l = 1, 2. This choice comes from [5, equation (28)].The nonlinearity in Figure 3 is given by p n := w n +0.2w 2 n −0.1w 3 n , for all n, as in [5, equation (29)].Gaussian i.i.d.noise (n n ) n , with zero mean and SNR = 10 dB with respect to (p n ) n , is added to give the received signal (x n ) n .As in [11,30], the data space is the Euclidean R 4 , and the data are formed as x n := (x n , x n−1 , x n−2 , x n−3 ) t ∈ R 4 , for all n ∈ Z ≥0 .The label y n , at time instant n, is defined by the transmitted training symbol s n−τ , for all n ∈ Z ≥0 , where τ := 1 [5].The dimension of the data space and the parameter τ are the equalizer order and delay, respectively [5].The Gaussian (RBF) kernel was used (cf.Section 2.1) in order to perform the classification task in an infinite dimensional RKHS H [1][2][3].
We compared the proposed methodology with the stochastic gradient descent method NORMA [29, Section III.A], which is a soft margin generalization of the classical kernel perceptron algorithm [29, Section VI.A].The results are demonstrated in Figures 4, 5, 6, 7, and 8.The misclassification rate is defined as the ratio of the misclassifications (cf.Section 3) to the number of the test data, which are taken to be 100.A number of 100 experiments were performed and uniformly averaged to produce each curve in the figures.
In Figure 4, the transfer function of the LTI system in Figure 3 is set to H 1 (z), z ∈ C. The variance σ 2 of the Gaussian kernel is set to σ 2 := 0.5.Recall here that the value of L b is closely related to the available computational resources of our system (refer to Section 5).Here we choose the value L b = 500, which was set to coincide with the time instant a sudden system change occurs in Figures 7  and 8.The same buffer with length L b was also used for the NORMA and the kernel perceptron methods, with a learning rate of η n := 1/ √ n, for all n ∈ Z >0 , as suggested in [29].The physical meaning of the parameter α is given in Section 5, where we have already seen that it defines a  Figure 8: A channel switch occurs at time n = 500, from H 1 to H 2 , for the LTI system in Figure 3.The variance of the Gaussian kernel function is σ 2 := 0.5.The parameter q = 16.These curves correspond to different values of the pair (α, L b ), and more specifically, "APSM(b1)" corresponds to (0.9, 150), "APSM(b2)" to (0.75, 200), "APSM(b3)" to (0.5, 500), and "APSM(b4)" to (0.1, 1000).
Depending on the application, and the sparsity the designer wants to impose on the system, different ranges for α are expected (see [36] and Figure 8).The parameter ν NORMA which controls the soft margin adjustments of NORMA method is set to ν NORMA := 0.01, since it produced the best results after extensive experimentation.This value is also suggested in [29].The APSM with q = 1 (no concurrent processing) and the APSM with q = 4 are employed here.Both the simple and the concurrent APSMs use the extrapolation parameter μ n := 1, for all n ∈ Z ≥0 .For the parameters which control the margin (see Section 4.1), we let ρ 0 := 1, θ 0 := 1.This choice of ρ 0 and θ 0 provides for the initial value of 1 for the margin in Section 4.1, which is also a typical initial value in online [29] and SVM [1] settings.We have seen, by extensive experimentation, that the best results were produced for a slowly changing sequence (ρ n ) n .To guarantee such a behaviour, we assign small values to the step size δθ:= 10 −3 and to the slope ν APSM := 10 −1 .We also let the threshold for the feasibility rate of Section 4.1 be R := 1/2.It can be verified by Figure 4 that both of the APSMs, that is, the nonconcurrent (q = 1) and the concurrent (q = 4), show faster convergence than the stochastic gradient descent methods of NORMA and kernel perceptron.Moreover, the concurrent APSM (q = 4) exhibits also a lower misclassification error level but with a computational cost of q = 4 times the cost of NORMA and of the kernel perceptron methods.Notice that the extrapolation parameter μ n was set to the value 1, that is, we did not take advantage of the freedom of choosing μ n ∈ [0, 2 M n ] which necessitates, however, an additional computational complexity of order O(q 2 ) for the calculation of the parameter M n in (25e).The average number of the basis elements was found to be 110.
In Figure 5, we compare two different sparsification methods for the APSM: one presented in [30], that is, Algorithm 1 and denoted by APSM(a), and the other presented in Section 5 and denoted by APSM(b).The parameters for both methods were fixed in order to produce the same misclassification error level.For both realizations, the concurrent APSM used a q = 16 for the index set J n , n ∈ Z ≥0 .The variance of the Gaussian kernel is set to σ 2 := 0.5, the radius of the closed ball in (8a) to δ := 2, the parameter α := 0.5, and the buffer length L b := 500.The buffer length N b associated with the sparsification method APSM(a) (see the comments below Algorithm 1) was set to N b := 500.We notice that the concurrent APSM(b) converges faster than the APSM(a).This is achieved, however, with an additional cost of order O(L 2 n ) due to the operation (18).Even slower, the concurrent APSM(a) achieves the same misclassification error level as the concurrent APSM(b).Moreover, we do not notice such big differences between the nonconcurrent versions of the APSMs for both types of sparsification.
To exploit the extrapolation parameter μ n and its range [0, 2 M n ], we conducted the experiment depicted in Figure 6.The cardinality of the index set J n was set to q := 8, and all the parameters regarding the APSMs, as well as the NORMA and the kernel perceptron method, are the same as in the previous figures, but the variance of the Gaussian kernel function was set to σ 2 := 0.2.The extrapolated version of the APSM uses a parameter value μ n := 1.9 M n , for all n ∈ Z ≥0 .We observe that extrapolation indeed speeds up convergence, with an increased cost of order O(q 2 ) due to the necessary calculation of M n in (25e).It is also worth mentioning that the NORMA performs poorly, even compared to the kernel perceptron method for this RKHS H .
To study the effect of the coefficient α together with the length L b of the buffer, we refer to Figures 7 and 8, where a sudden channel change occurs, from the H 1 LTI system to the H 2 one, at the time instant 500.The coefficient α, in Figure 7, was set to 0, while we assume that the buffer length is infinite, that is, L b := ∞.In both figures the variance of the Gaussian kernel is set to 0.5, and the parameter q := 16 for the concurrents APSMs, that is, for the cardinality of J n , for all n ≥ 16 (see (13)).It is clear that the concurrent processing offered by the APSM remains by far the more robust approach since it achieves fast convergence as well as low misclassification rate level.In Figure 8, we examine the performance of the proposed sparsification scheme for various values of (α, L b ) and only for the concurrent version of the APSM.First, we notice that the introduction of sparsification in Figure 8 raises the misclassification rate level when compared with the design of unlimited computational resources, that is, (α, L b ) := (0, ∞) of Figure 7.In Figure 8, the pair (α, L b ) takes various values, so that "APSM(b1)" associates to the pair (0.9, 150), "APSM(b2)" to (0.75, 200), "APSM(b3)" to (0.5, 500), and "APSM(b4)" to (0.1, 1000).These values were chosen in order to produce the same misclassification rate level for all the curves.This experiment shows a way to choose the values of (α, L b ), whenever a constraint is imposed on the length L b of the buffer to be used.The more the buffer length is decreased, or in other words, the less the cardinality of the basis we want to build, and in order to keep the same misclassification rate level, the more the parameter α has to be increased in order for the new elements in the sequence (κ(x n , •)) n to enter the basis less frequently.

CONCLUSIONS
This paper presents a sparsification method to the online classification task, based on a sequence of linear subspaces and combined with the convex analytic approach of the adaptive projected subgradient method (APSM).Limitations on memory and computational resources, which are inherent in online systems, are accommodated by inserting an upper bound on the dimension of the sequence of the subspaces.The design obtains a geometric perspective by means of projection mappings.To validate the design, an adaptive equalization problem for a nonlinear channel is considered, and the proposed method was compared not only with classical and recent stochastic gradient descent methods, but also with a sparsified version of the APSM with a norm constraint.

APPENDICES A. PROOF (I) OF V n IS A LINEAR VARIETY AND (II) OF (12)
Fix n ∈ Z ≥0 and define the mapping A : The mapping A is clearly linear and also bounded [37,38] since if we recall that the norm of A is A := sup u ≤1 A(u) , we can easily verify that for all u such that u ≤ 1.The adjoint operator A * : R qn → H × R of A is then linear and bounded [38, Theorem 6.5.1].
To find its expression, we know by definition that λ t A(u) = u, A * (λ) , for all u ∈ H × R, for all λ ∈ R qn .Now, by simple algebraic manipulations, we obtain that λ j a j,n =: a 1,n , . . ., a qn,n λ. (A.4) The mapping AA * is given clearly by AA * (λ) = a1,n,A * (λ) ... aq n,n ,A * (λ) , for all λ ∈ R qn .Moreover, one can easily verify that for all i ∈ 1, q n , a i,n , A * (λ) = a i,n , qn j=1 λ j a j,n = qn j=1 λ j a i,n , a j,n , (A.5) so that we have AA * (λ) = G n λ, for all λ ∈ R qn , where the (i, j)th element of G n is defined as a i,n , a j,n H×R , for all i, j ∈ 1, q n .Since a j,n was defined as a j,n := y j (κ(x j , •), 1), it can be easily seen by the inner product in H × R that a i,n , a j,n H×R = y i y j κ(x i , x j ) + y i y j , for all i, j ∈ 1, q n .As a result, AA * = G n .Now, by A the set V n obtains an alternative expression; V n = arg min u∈H×R ρ (n) − A(u) , where ρ (n) By this new expression of V n , we see by [38, Theorem 6.9.1] that V n is the set of all those elements that satisfy the equations Clearly, V n is also a linear variety.By the linearity of A * , we obtain By the definition of the pseudoinverse operator [38, Section 6.11], the unique element of V n with the smallest norm is given by u * := A † (e n (u n )), where A † is the pseudoinverse operator of A [38].Thus, and by the uniqueness of P Vn (u n ), we obtain , which completes the proof of (12).

B. PROOF OF (23)
Since K n+1 K −1 n+1 = I Ln+1 , by multiplying ( 21) with ( 22) we obtain the following two equations: where I m stands for the identity matrix of dimension m ∈ Z >0 .Notice that since both K n+1 and K −1 n+1 are positive definite, we obtain that s n+1 > 0 and that H n+1 is positive definite [41]

C. PROOF OF PROPOSITION 2
We will prove Proposition 2 by mathematical induction on n ∈ Z ≥0 .Since by definition f 0 := 0, we have By the definition of the mapping π n in ( 14), we see that π n−1 ( f n ) ∈ M n , which means that there exists a set of real numbers {η to establish the relation given in Proposition 2. Since This completes the proof of Proposition 2.

MAIN NOTATIONS
H , •, • , and • : The reproducing kernel Hilbert space (RKHS), its inner product, and its norm f : A n e l e m e n t o f H κ(•, •): The kernel function (x n , y n ) n∈Z≥0 : Sequence of data and labels P C : Metric projection mapping onto the closed convex set C P M,M : Oblique projection on the subspace M along the subspace M g The classifier given by means of f ∈ H and the offset b j 1 , j 2 := { j 1 , j 1 + 1, . . ., j 2 }: An index set of consecutive integers The index set which shows which closed half-spaces are concurrently processed at each time instant n Π + j,n : The closed half-spaces to be concurrently processed (x j , y j , ρ (n) j ): The triplet of data, labels, and margin parameters that define Π + j,n μ n and μ n : Extrapolation parameters with ranges μ n ∈ [0, 2M n ] and μ n ∈ [0, 2 M n ], where M n and M n are given by (8e) and (25e), respectively ν APSM , θ 0 , δθ, ρ 0 : Parameters that control the margins in Section 4.1 M n , B n , and L n : A subspace, its base, and its dimension, used for sparsification The basis elements of the basis B n π n : The mapping defined by ( 14) A ne l e m e n to fM n and its coefficient vector, which approximate the point κ(x j , •) by ( 15) K n : The Gram matrix formed by the elements of the basis B n ζ (n+1)  xn+1 and c (n+1) xn+1 : The coefficient vector of the projection P Mn (κ(x n+1 , •)) onto M n and the coefficient vector in the normal equations of (18) d n+1 : The distance of κ(x n+1

Call for Papers
Steganography, the art and science of invisible communication, aims to transmit information that is embedded invisibly into carrier data.Different from cryptography it hides the very existence of the secret.Its main requirement is undetectability, that is, no method should be able to detect a hidden message in carrier data.This also differentiates steganography from watermarking where the secrecy of hidden data is not required.Watermarking serves in some way the carrier, while in steganography, the carrier serves as a decoy for the hidden message.
The theoretical foundations of steganography and detection theory have been advanced rapidly, resulting in improved steganographic algorithms as well as more accurate models of their capacity and weaknesses.
However, the field of steganography still faces many challenges.Recent research in steganography and steganalysis has far-reaching connections to machine learning, coding theory, and signal processing.There are powerful blind (or universal) detection methods, which are not fine-tuned to a particular embedding method, but detect steganographic changes using a classifier that is trained with features from known media.Coding theory facilitates increased embedding efficiency and adaptiveness to carrier data, both of which will increase the security of steganographic algorithms.Finally, both practical steganographic algorithms and steganalytic methods require signal processing of common media like images, audio, and video.The field of steganography still faces many challenges, for example, • how could one make benchmarking steganography more independent from machine learning used in steganalysis?• where could one embed the secret to make steganography more secure?(content adaptivity problem).• what is the most secure steganography for a given carrier?
The main goal of this special issue is to provide a stateof-the-art view on current research in the field of steganographic applications.Some of the related research topics for the submission include, but are not limited to: • Performance, complexity, and security analysis of steganographic methods

Call for Papers
The recent development of high-throughput molecular genetics technologies has brought a major impact to bioinformatics, genomics, and proteomics.Classical signal processing techniques have found powerful applications in extracting and modeling the information provided by genomic and proteomic data.This special issue calls for contributions to modeling and processing of data arising in bioinformatics, genomics, and proteomics using signal processing techniques.Submissions are expected to address theoretical developments, computational aspects, or specific applications.However, all successful submissions are required to be technically solid and provide a good integration of theory with practical data.Suitable topics for this special issue include but are not limited to:

Figure 1 :
Figure 1: An illustration of the metric projection mapping P C onto the closed convex subset C of H , the projection P B[ f0,δ] onto the closed ball B[ f 0 , δ], the orthogonal projection P M onto the closed linear subspace M, and the oblique projection P M,M on M along the closed linear subspace M .

Figure 3 :Figure 4 :
Figure 3: The model of the nonlinear channel for which adaptive equalization is needed.

Figure 5 :
Figure 5: Tracking performance for the channel in Figure 3 when the LTI system is H 1 .We let card(J n ) := 16, for all n.The variance of the Gaussian kernel takes the value of σ 2 := 0.5.The APSM(a) refers to Algorithm 1 while APSM(b) refers to Algorithm 3. The radius of the closed ball is set to δ := 2. The buffer length L b := 500, and α := 0.5.

Figure 7 :
Figure7: A channel switch occurs at time n = 500, from H 1 to H 2 , for the LTI system in Figure3.No sparsification for the APSMs, and no regularization for NORMA is considered here.The variance of the Gaussian kernel function is kept to the value of σ 2 := 0.5.
. Hence, H −1 n+1 exists.If we multiply (B.1) on , •) from M n defined in (19) α and L b : The threshold of approximate linear dependency/independency and the length of the buffer (upper bound for L n ) used for the kernel expansion in (26) r n+1 , h n+1 , H n+1 , and s n+1 , p n+1 , P n+1 :

•
Practical secure steganographic methods for images, audio, video, and more exotic media and bounds on detection reliability • Adaptive, content-aware embedding in various transform domains • Large-scale experimental setups and carrier modeling • Energy-efficient realization of embedding pertaining encoding and encryption • Steganography in active warden scenario, robust steganography • Interplay between capacity, embedding efficiency, coding, and detectability • Steganalytic application in steganography benchmarking and digital forensics • Attacks against steganalytic applications • Information-theoretic aspects of steganographic security Authors should follow the EURASIP Journal on Information Security manuscript format described at the journal site http://www.hindawi.com/journals/is/.Prospective authors should submit an electronic copy of their complete manuscript through the journal Manuscript Tracking System at http://mts.hindawi.com/according to the following timetable: Institute for System Architecture, Faculty of Computer Science, Dresden University of Technology, Helmholtzstraße 10, 01069 Dresden, Germany; westfeld@inf.tu-dresden.de