High-dimensional neural feature design for layer-wise reduction of training cost

We design a rectified linear unit-based multilayer neural network by mapping the feature vectors to a higher dimensional space in every layer. We design the weight matrices in every layer to ensure a reduction of the training cost as the number of layers increases. Linear projection to the target in the higher dimensional space leads to a lower training cost if a convex cost is minimized. An ℓ2-norm convex constraint is used in the minimization to reduce the generalization error and avoid overfitting. The regularization hyperparameters of the network are derived analytically to guarantee a monotonic decrement of the training cost, and therefore, it eliminates the need for cross-validation to find the regularization hyperparameter in each layer. We show that the proposed architecture is norm-preserving and provides an invertible feature vector and, therefore, can be used to reduce the training cost of any other learning method which employs linear projection to estimate the target.


Introduction
Nonlinear mapping of low-dimensional signal to high-dimensional space is a traditional method for constructing useful feature vectors, specifically for classification problems.The intuition is that, by extending to a high dimension, the feature vectors of different classes become easily separable by a linear classifier.The drawback of performing classification in a higher dimensional space is the increased computational complexity.This issue can be handled by a well-known method called "kernel trick" in which the complexity depends only on the inner products in the high-dimensional space.Support vector machine (SVM) [1] and kernel PCA (KPCA) [2] are examples of creating highdimensional features by employing the kernel trick.The choice of the kernel function is a critical aspect that can affect the classification performance in the higher dimensional space.A popular kernel is the radial basis function (RBF) kernel or Gaussian kernel, and its good performance is justified by its ability to map the feature vector to a very high, infinite, dimensional space [3].In this manuscript, we design a high-dimensional feature using an artificial neural network (ANN) architecture to achieve a better classification performance by increasing the number of layers.The architecture uses the rectified linear unit (ReLU) activation, predetermined orthonormal matrices, and a fixed structured matrix.We refer to this as high-dimensional neural feature (HNF) throughout the manuscript.
Neural networks and deep learning architectures have received overwhelming attention over the last decade [4].Appropriately trained neural networks have been shown to outperform the traditional methods in different applications, for example, in classification and regression tasks [5,6].By the continually increasing computational power, the field of machine learning is being enriched with active research pushing classification performance to higher levels for several challenging datasets [7][8][9].However, very little is known regarding how many numbers of neurons and layers are required in a network to achieve better performance.Usually, some rule-of-thumb methods are used for determining the number of neurons and layers in an ANN, or an exhaustive search is employed which is extremely time-consuming [10].In particular, the technical issue-guaranteeing performance improvement with increasing the number of layers-is not straightforward in traditional neural network architectures, e.g., deep neural network (DNN) [11], convolutional neural network (CNN) [12], and recurrent neural network (RNN) [13].We endeavor to address this technical issue by mapping the feature vectors to a higher dimensional space using predefined weight matrices.
There exist several works employing predefined weight matrices that do not need to be learned.Scattering convolution network [14] is a famous example of these approaches which employs wavelet-based scattering transform to design the weight matrices.Random matrices have also been widely used as a mean for reducing the computational complexity of neural networks while achieving comparable performance as with fully learned networks [15][16][17][18].In the case of the simple, yet effective, extreme learning machine (ELM), the first layer of the network is assigned randomly chosen weights and the learning takes place only at the end layer [19][20][21][22].It has also been shown recently that a similar performance to fully learned networks may be achieved by training a network with most of the weights assigned randomly and only a small fraction of them being updated throughout the layers [23].It has been shown that networks with Gaussian random weights provide a distance-preserving embedding of the input data [16].The recent work [18] designs a deep neural network architecture called progressive learning network (PLN) which guarantees the reduction of the training cost with increasing the number of layers.In PLN, every layer is comprised of a predefined random part and a projection part which is trained individually using a convex cost function.These approaches indicate that randomness has much potential in terms of high performance at low computational complexity.We design a multilayer neural network using predefined orthonormal matrices, e.g., random orthonormal matrix and DCT matrix, to ensure reducing the training cost as the number of layers increases.

Our contributions
Motivated by the prior use of fixed matrices, we design the HNF architecture using an appropriate combination of ReLU, random matrices, and fixed matrices.We use predefined weight matrices in every layer of the network, and therefore, the architecture does not suffer from the infamous vanishing gradient problem.We theoretically show that the output of each layer provides a richer representation compared to the previous layers if a convex cost is minimized to estimate the target.We use an 2 -norm convex constraint to reduce the generalization error and avoid overfitting to the training data.We analytically derive the regularization hyperparameter to ensure the decrement of the training cost in each layer.Therefore, there is no need for cross-validation to find the optimum regularization hyperparameters of the network.We show that the proposed HNF is norm-preserving and invertible and, therefore, can be used to improve the performance of other learning methods that use linear projection to estimate the target.Finally, we show the classification performance of the proposed HNF against ELM and state-ofthe-art results.Note that a preliminary version of this manuscript has been submitted to ICASSP 2020 recently.

Notations
We use the following notations unless otherwise noted: We use bold capital letters, e.g., W, to denote matrices and bold lowercase letters, e.g., x, to denote vectors.We use calligraphic letter M to denote a set and M c to denote compliment set.The cardinality of a set M is denoted by |M|.For a scalar x ∈ R, let us denote its sign and magnitude as s(x) ∈ {−1, +1} and |x|, respectively, and write x = s(x)|x|.For a vector x, we define the sign vector s(x) and magnitude vector |x| by the element-wise operation.We define g(•) as a nonlinear function comprised of a stack of element-wise ReLU activation functions.A vector x has non-negative part x + and non-positive part x − such that x = x + + x − and g(x) = x + .We use • and • F to denote 2 -norm and Frobenius norm, respectively.For example, it can be seen that

Proposed method
In this section, we illustrate the motivation to design a high-dimensional feature vector by using ReLU activation function.We analyze the behavior of a single layer ReLU network to the input perturbation noise and show that by mapping the feature vectors to a higher dimension, we can increase the discrimination power of the ReLU network.
For an ANN, we wish to have noise robustness and discriminative power.We characterize this in the following definition.Definition 1 (Noise Robustness and Point Discrimination) Let x 1 and x 2 be two input vectors such that x 1 = x 2 , and we have outputs of ANN t1 = f(x 1 ) and t2 = f(x 2 ).We can characterize a perturbation scenario with the perturbation noise as x 2 = x 1 + .We wish that the proposed ANN holds the property Note that the lower bound provides point discrimination power and the upper bound provides noise robustness to the input.

Layer construction
We first concentrate on one block of ANN-this is called a layer in the neural network literature.The layer has an input vector q ∈ R m×1 and the output vector y = g(Wq).The dimension of y is the number of neurons in the layer.If we can guarantee that the layer of ANN provides noise robustness and point discrimination property, then the full ANN comprising of multiple layers connected sequentially can be guaranteed to hold robustness and discriminative properties.We need to construct W in such a manner that the layer has noise robustness and discriminative power according to Definition 1.

ReLU activation and a limitation
We first show three essential properties of ReLU function, required to develop our main results.We then discuss one possible limitation of the ReLU function and propose a remedy to circumvent the problem.
Property 3 (Noise Robustness) Let us consider z = Wq.For two vectors q 1 and q 2 , we define corresponding vectors z 1 = Wq 1 and z 2 = Wq 2 , and output vectors y 1 = g(z 1 ) = g(Wq 1 ) and y 2 = g(z 2 ) = g(Wq 2 ).Now, we have the following relation The proof of Property 3 is shown in Appendix 1.The upper bound relation holds Lipschitz continuity that provides noise robustness.On the other hand, the lower bound being zero cannot maintain a minimum distance between two points y 1 and y 2 .An example of extreme effect is that when z 1 and z 2 are non-positive vectors, we get y 1 − y 2 2 = 0.
This may limit the capacity of the ReLU function for achieving a good discriminative power.A reason for the limitation "lower bound being zero" is due to the structure of the input matrix W. We build an appropriate structure for the input matrix to circumvent the limitation.
We now engineer a remedy for this limitation.Let us consider ȳ = g(Vz) = g(VWq), where z = Wq ∈ R n and V is a linear transform matrix.For two vectors q 1 and q 2 , we have corresponding vectors z 1 = Wq 1 and z 2 = Wq 2 , and output vectors ȳ1 = g(Vz 1 ) and ȳ2 = g(Vz 2 ).Our interest is to show that there exists a predefined matrix V for which we have both noise robustness and discriminative power properties.

Proposition 1 Let us construct a V matrix as follows
For the output vectors ȳ1 = g(V n z 1 ) ∈ R 2n and ȳ2 = g(V n z 2 ) ∈ R 2n , we have ȳ1 The proof of the above proposition can be found in Appendix 2. Based on the above proposition, we can interpret the effect of noise passing through such layer.Let z 2 = z 1 + z, where z is a small perturbation noise.Note that z = z 1 − z 2 = W[ q 1 − q 2 ] = W q. To investigate effect of perturbation noise, we now state our main assumption.
Assumption 1 Given a small z 2 , the sign patterns of z 1 and z 2 do not differ significantly.On the other hand, for a large perturbation noise z 2 , the sign patterns of z 1 and z 2 vary significantly.
The above assumption means that for a small z 2 , the set ) is close to an empty set.On the other hand, for a large z 2 , the set ) is close to a full set.Considering Assumption 1, we can present the following remark regarding the effect of noise in the layer.
Remark 1 (Effect of perturbation noise) For a small perturbation noise . On the other hand, a large perturbation noise is attenuated.
This follows from the proof of Proposition 1, specifically Eqs. ( 28) and (29a).In fact, if We interpret that a small perturbation noise passes through the single layer g(Vz) almost not attenuated.Let us construct an illustrative example.Assume that and we can comment that the perturbation noise is attenuated.

High-dimensional neural feature
In this section, we employ the proposed weight matrix in (3) to construct a multilayer ANN.We show that by designing the weight matrices in every layer, it is possible to construct a network that provides noise robustness and point discrimination according to Definition 1.
Let us establish the relation between the input vector q ∈ R m and output vector ȳ ∈ R 2n .For two vectors q 1 and q 2 , we have corresponding vectors z 1 = Wq 1 and z 2 = Wq 2 , and output vectors ȳ1 = g(V n z 1 ) = g(V n Wq 1 ) and ȳ2 = g(V n z 2 ) = g(V n Wq 2 ).Our interest is to show that it is possible to construct a W ∈ R n×m matrix for which we have both noise robustness and discriminative power properties.We can construct W ∈ R n×m as orthonormal matrix, such that n ≥ m and W W = I m .In that case, we have q 1 − q 2 2 = z 1 − z 2 2 for any pair of (q 1 , q 2 ).By combining the this relation with Eq. ( 4), we conclude the following proposition.
Proposition 2 Consider the single layer network ȳ = g(V n z) = g(V n Wq) where W ∈ R n×m is an orthonormal matrix, such that n ≥ m and W W = I m .Then, ȳ 2 = q 2 , and for every two vectors q 1 and q 2 , the following inequality holds The above proposition shows that by designing the weight matrix in a single layer network, it is possible to provide point discrimination and noise robustness according to Definition 1.Note that the weight matrix W can be any orthonormal matrix such as instances of random orthonormal matrix and DCT matrix.By considering the relation z = W q, we can present a similar argument as in Remark 1.We interpret that a small perturbation noise q 2 passes through the single layer g(VWq) almost not attenuated.This is stated in the following remark.
Remark 2 (Effect of perturbation noise) For a small q 2 , we have ȳ1 − ȳ2 2 ≈ q 1 − q 2 2 .On the other hand, a large perturbation noise is attenuated.
By directly using Proposition 1, we can present a similar bound in regard to the perturbation of the weight matrix W in a single layer construction.We can show that the perturbation norm in the output due to the perturbation to the weight matrix has an upper bound that is a scaled version of the input norm.The scaling parameter W 2 F is small for a small perturbation.The following remark illustrates this point in detail.
Remark 3 (Sensitivity to the weight matrix) Let the weight matrix W be perturbed by W. The effective weight matrix is W + W. For an input q and the respective outputs ȳ = g(V n Wq) and ȳ = g(V n [ W + W] q), we have The proof can be found in Appendix 3.

Multilayer construction
A feedforward ANN is comprised of similar operational layers in a chain.Let us consider two layers in feedforward connection, e.g., l-th and (l + 1)-th layers of an ANN.For the l-th layer, we use a superscript (l) to denote appropriate variables and parameters.Let the l-th layer has m (l) nodes.The input to the lth layer is q (l) = ȳ(l−1) .The output of l-th layer ȳ(l) = g V n (l) z (l) = g V n (l) W (l) q (l) is next used as the input to the (l + 1)-th layer, that means ȳ(l) = q (l+1) .Thus, the output of (l + 1)-th layer is Now, for the two vectors q (l) 2 , we have the following relations in l-layer based on Proposition 2 We present the above results as the following theorem to provide noise robustness and discrimination power properties of the proposed ANN and call it high-dimensional neural feature (HNF) afterwards.

Theorem 1 The proposed HNF uses ReLU activation function and is constructed as follows:
(a) The HNF is comprised of L-layers where the l-th layer has the corresponding structure ȳ(l) = g(V n (l) W (l) ȳ(l−1) ).The L-layers are in a chain.The input to the first layer is q (1) = x.The output of HNF is W (1) x))).
Then, ȳ(L) 2 = x 2 , and the construted HNF provides noise robustness and discriminative power properties that are characterized by the following relation where x 1 ∈ R m (1) and x 2 ∈ R m (1) are two input vectors to the HNF and their corresponding outputs are ȳ(L) 1 and ȳ(L) 2 , respectively.
Note that a similar argument as in Remark 2 holds here as well.We interpret that a small perturbation noise passes through the multilayer structure almost not attenuated.On the other hand, a large perturbation noise is attenuated in every layer.Using Theorem 1, we follow similar arguments as in Remark 3 in regard to the perturbation of the weight matrices W (l) in every layer of the HNF.
Remark 4 (Sensitivity to the weight matrix) Consider a scenario where the weight matrix W (l) is perturbed by W (l) .The effective weight matrix is W (l) + W (l) .We can show that

Reduction of training cost
In this section, we analyze the effectiveness of the weight matrix V in the sense of reducing the training cost.We show that the proposed HNF provides lower training costs as the number of layers increases.We also present how the proposed structure can be used to reduce the training cost of other learning methods which employ linear projection to the target.Consider a dataset containing N samples of pairwise P-dimensional input data x ∈ R P and Q-dimensional target vector t ∈ R Q as D = {(x, t)}.Let us construct two single layer neural networks and compare effectiveness of their feature vectors.In one network, we construct the feature vector as y = g(Wx), and in the other network, we build the feature vector ȳ = g(V n W x). We use the same input vector x, predetermined weight matrix W ∈ R n×P , and ReLU activation function g(•) for both networks.However, in the second network, the effective weight matrix is fully predetermined.To predict the target, we use a linear projection of feature vector.Let the predicted target for the first network be Oy, and the predicted target for the second network Ōȳ.Note that O ∈ R Q×n and Ō ∈ R Q×2n .By using 2 -norm regularization, we find optimal solutions for the following convex optimization problems.
where the expectation operation is done by sample averaging over all N data points in the training dataset.The regularization parameter is the same for the two networks.By defining z Wx, we have The above relation is due to the special structure of V n and the use of ReLU activation g(•).Note that the solution Ō =[ O 0] exists in the feasible set of the minimization (11b), i.e., [ O 0] 2 F ≤ , where 0 is a zero matrix of size Q × n.Therefore, we can show the optimal costs of the two networks have the following relation where the equality happens when Ō =[ O 0].Any other optimal solution of Ō will lead to inequality relation due to the convexity of the cost.Therefore, we can conclude that the feature vector ȳ of the second network is richer than the feature vector y of the first network in the sense of reduced training cost.The proposed structure provides an additional property for the feature vector ȳ which we state in the following proposition.The proof idea of the proposition will be used in the next section to construct a multilayer structure, and therefore, we present the proof here.
Proposition 3 For the feature vector ȳ = g(V n Wx), there exists an invertible mapping h(ȳ) = x when the weight matrix W is full-column rank.
Proof We now state lossless flow property (LFP), as used in [17,24].A nonlinear function g(•) holds the LFP if there exist two linear transformations V and U such that Ug(Vz) = z, ∀z ∈ R n .It is shown in [17] that ReLU holds LFP.In other words, if every z when g(•) is ReLU.Letting z = Wx, we can easily find x = W † z = W † U n ȳ, where † denotes pseudo-inverse when W is a full-column rank matrix.Therefore, the resulting inverse mapping h would be linear.

Reduction of training cost with depth
In this section, we show that the proposed HNF provides lower training costs as the number of layers increases.Consider an L-layer feedforward network according to our proposed structure on the weight matrices as follows Note that 2n (L) is the number of neurons in the L-th layer of the network.The inputoutput relation in each layer is characterized by ȳ(1) = g(V n (1) W (1) x), (15a) where W (1) ∈ R n (1) ×P , W (l) ∈ R n (l) ×m (l) , and m (l) = 2n (l−1) for 2 ≤ l ≤ L. Let the predicted target using the l-th layer feature vector ȳ(l) be O l ȳ(l) .We find optimal solutions for the following convex optimization problems Let us define z (l) W (l) y (l−1) .Assuming that weight matrices W (l) are full-column rank, we can similarly derive y (l−1) =[ W (l) ] † z (l) .By using Proposition 3, we have z (l) = U n (l) ȳ(l) , and then, we can write the following relations where , by using ( 17), we can easily see that the feasible set of the minimization (16b), we can guarantee that the optimal cost of lth layer would be lower or equal than that of layer (l − 1).In particular, by choosing F , we can see that the optimal costs follow the relation where the equality happens when we have . Any other optimal solution of O l will lead to inequality relation due to the convexity of the cost.Therefore, we can conclude that the feature vector ȳ(l) of an l-layer network is richer than the feature vector ȳ(l−1) of an (l − 1)-layer network in the sense of reduced training cost.Note that if we choose the weight matrix W (l) to be orthonormal, then where we have used the fact that , a sufficient condition to guarantee the cost relation ( 18) is to use the relation between regularization parameters as l ≥ 2 l−1 .We can choose l = 2 l−1 = 2 l−1 1 .Note that the regularization parameter 1 in the first layer can also be determined analytically.Consider O ls to be the solution of the following least-square optimization Note that the above minimization has a closed-form solution.Similar to the argument in (18), by choosing 1 = O ls [ W (1) ] † U n (1) 2  F , it can be easily seen that where the equality happens only when we have O 1 = O ls [ W (1) ] † U n (1) .Similar to Proposition 3, we can prove the following proposition regarding the invertibility of the feature vector at the l-th layer of the proposed structure.
Proposition 4 For the feature vector ȳ(L) in ( 14), there exists an invertible mapping function h(ȳ (L) ) = x when the set of weight matrices {W (l) } L l=1 are full-column rank.
Proof It can be proved by repeatedly using the lossless flow property (LFP) similar to Proposition 3.

Reduction of training cost of ELM
Note that the feature vector ȳ(1) in (15a) can be any feature vector that is used for linear projection to the target in any other learning method.In Section 4.1, we assume ȳ(1) to be the feature vector constructed from x using the matrix V, and therefore, the regularization parameter 1 is derived to guarantee performance improvement compared to least-square method as shown in (21).A potential extension would be to build the proposed HNF using the feature vector ȳ(1) from other methods that employ linear projection to estimate the target.For example, the extreme learning machine (ELM) uses a linear projection of the nonlinear feature vector to predict the target [19].In the following, we build the proposed HNF by employing the feature vector used in ELM to improve the performance.
Similar to Eq. ( 18), we can show that it is possible to improve the feature vector of ELM in the sense of training cost by using the proposed HNF.Consider ȳ(1) = g(W (1) x), to be feature vector used in ELM for linear projection to the target.In the ELM framework, W (1) ∈ R n (1) ×P is an instance of normal distribution, not necessarily full-column rank, and g(•) can be any activation function, not necessarily ReLU.The optimal mapping to the target in ELM is found by solving the following minimization problem.
Note that this minimization problem has a closed-form solution.We construct the feature vector in the second layer of the HNF as where W (2) ∈ R n (2) ×m (2) and m (2) = n (1) .The optimal mapping to the target by using this feature vector can be found by solving where 2 is the regularization parameter.By choosing 2 = O elm [ W (2) ] † U n (2) 2 F , we can see that the optimal costs follow the relation where the equality happens when we have O 2 = O elm [ W (2) ] † U n (2) .Otherwise, the inequality has to follow.Similarly, we can continue to add more layer to improve the performance.Specifically, for l-th layer of the HNF, we have ȳ(l) = g V n (l) W (l) ȳ(l−1) , and we can show that Eq. ( 18) holds here as well when the set of matrices {W (l) } L l=2 are full-column rank.

Practical considerations
The dimension of feature vector ȳ(l) increases as the number of layers increases.For a multilayer feedforward network, if we use orthonormal matrix W (l) for l-th layer, then each layer produces a feature vector that has at least twice the dimension of the input feature vector.At the L-th layer, we get the dimension 2 L times of the input data dimension.Note that ȳ = g(V n z) is norm-preserving by Proposition 1, that means ȳ 2 = z 2 .Using this principle successively, the full network is also norm-preserving, that means ȳ(L) 2 = x 2 .Therefore, as the layer number increases, the amplitudes of scalars of the feature vector ȳ(L) diminish at the rate of 2 L .We show that the proposed HNF does not require a large number of layers to improve the performance.This also answers the natural question that whether many layers are practically required for an ANN.Note that since the dimension of the feature vector ȳ(l) is growing exponentially as 2 l , the proposed HNF is not suitable for cases where the input dimension is too large.One way to circumvent this issue is to employ the kernel trick [3] by using the feature vector ȳ(l) .We will address this solution in future works.

Results and discussion
In this section, we carry out experiments to validate the performance improvement and observe the effect of using the matrix V in the architecture of an HNF.We report our results for three popular datasets in the literature as in Table 1.Note that we only choose the datasets where the input dimension is not very large due to the computational complexities.Letter dataset [27] contains a 16-dimensional feature vector for each of the 26 English alphabets from A to Z. Shuttle dataset [28] belongs to the STAT-LOG project and contains a 9-dimensional feature vector that deals with the positioning of radiators in the space shuttles.MNIST dataset [29] contains gray-scale 28 × 28pixel images of hand-written digits.Note that in all three datasets, the target vector t is one-hot vector of dimension Q (the number of classes).The optimization method used for solving the minimization problem (16b) is the alternating direction method of multipliers (ADMM) [30].The number of iterations of ADMM is set to 100 in all the simulations.We carry out two sets of experiments.First, we implement the proposed HNF with a fixed number of layers by using instances of random matrices for designing the weight matrix in every layer.In this setup, the weight matrix W (l) ∈ R n (l) ×m (l) is an instance of Gaussian distribution with appropriate size n (l) ≥ m (l) and entries drawn independently from N (0, 1) to ensure being full-column rank.Second, we construct the proposed HNF by using discrete cosine transform (DCT), as an example of full-column rank weight matrix, instead of random matrices.In this scenario, we may need to apply zero-padding before DCT to build the weight matrix W (l) with appropriate dimension.The step size in the ADMM algorithm is set accordingly in each of these experiments.Finally, we compare the performance and computational complextiy of HNF and backpropagation over the same-size network.

HNF using random matrix
In this subsection, we construct the proposed HNF by using instances of Gaussian distribution to design the weight matrix W (l) .In particular, the entries of the weight matrix are drawn independently from N (0, 1).For simplicity, the number of nodes is chosen according to n (l) = m (l) for l ≥ 2 in all the experiment.The number of nodes in the first layer 2n (1) is chosen for each dataset individually such that it satisfies n (1) ≥ P for every dataset with input dimension P, as reported in Table 1.The step size in the ADMM algorithm is set to 10 −7 in all the simulations in this subsection.Shuttle 43,500 14,500 9 7 99.3 250 5 99.0 99.6 1000 3 99.9 [19] MNIST 60,000 10,000 784 10 97.1 1000 5 96.9 97.7 4000 3 99.7 [26] We implement two different scenarios.First, we implement the proposed HNF with a fixed number of layers and show performance improvement throughout the layers.In this setup, the only hyperparameter that needs to be chosen is the number of nodes in the first layer 2n (1) .Note that the regularization parameter 1 is chosen such that it guarantees (21), and therefore eliminates the need for cross-validation in the first layer.Second, we build the proposed HNF by using the ELM feature vector in the first layer as in (23) and show the performance improvement throughout the layers.In this setup, the only hyperparameter that needs to be chosen is the number of nodes in the first layer n (1) which is the number of nodes of ELM to be exact.It has been shown that ELM performs better as the number of hidden neuron increases [24]; therefore, we choose a sufficiently large hidden neuron to make sure that ELM is performing at its best.Note that the regularization parameter 1 is chosen such that it guarantees (25), and therefore eliminates the need for cross-validation.Finally, we present the classification performance of the corresponding state-of-the-art results in Table 1.
The performance results of the proposed HNF with L = 5 layers are reported in Table 1.We report test classification accuracy as a measure to evaluate the performance.Note that the number of neurons 2n (1) in the first layer of HNF is chosen appropriately for each dataset such that it satisfies n (1) ≥ P. For example, for MNIST dataset, we set n (1) = 1000 ≥ P = 784.The performance improvement in each layer of HNF is given in Fig. 1, where train and test classification accuracy is shown versus total number of nodes in the network L l=1 2n (l) .Note that the total number of nodes being zero corresponds to direct mapping of the input x to the target using least-squares according to (20).It can be seen that the proposed HNF provides a substantial improvement in performance with a small number of layers.
Fig. 1 Training and testing accuracy against size of one instance of HNF.Size of an L-layer HNF is represented by the number of random matrix-based nodes, counted as L l=1 2n (l) .Here, L = 5 for all three datasets.The number of nodes in the first layer (2 n (1) ) is set according to Table 1 for each dataset The corresponding performance for the case of using the ELM feature vector in first layer of HNF is reported in Table 1.It can be seen that HNF provides a tangible improvement in performance compared to ELM.Note that the number of neurons in the first layer n (1) is, in fact, the same as the number of neurons used in ELM.We choose n (1) to get the best performance for ELM in every dataset individually.The number of layers in the network is set to L = 3 to avoid the increasing computational complexity.The performance improvement in each layer of HNF in this case is given in Fig. 2, where train and test classification accuracy is shown versus total number of nodes in the network n (1) + L l=2 2n (l) .Note that the initial point corresponding to n (1) is in fact equal to the ELM performance reported in Table 1, which is derived according to (22).
Finally, we compare the performance of the proposed HNF with the state-ofthe-art performance for these three datasets.We can see that the proposed HNF provides competitive performance compared to state-of-the-art results in the literature.It is worth mentioning that we have not used any pre-processing technique to improve the performance as in the state-of-the-art, but it can be done in future works.

HNF using DCT
In this subsection, we repeat the same experiments as in Section 5.1 by using DCT instead of the Gaussian weight matrix.The number of nodes in each layer of the network is chosen as in Section 5.1.We apply zero-padding before DCT in the first layer to build the weight matrix W (1) ∈ R n (1) ×P with appropriate dimension for each dataset.Note that n (l) = m (l) for l ≥ 2 in all the experiments, and therefore, there is no need to apply zeropadding in the next layers.The step size in the ADMM algorithm is set to 10 2 in all the simulations in this subsection.
We implement the same two scenarios.First, we implement the proposed HNF by using DCT and show performance improvement throughout the layers.Second, we build the proposed HNF by using the ELM feature vector in the first layer and DCT matrices in the next layers.Note that the regularization parameters l for l ≥ 2 are chosen according to (19).The choice of 1 is such that it guarantees (21) and (25) according to each scenario.
The performance results of the proposed HNF by using DCT matrices are reported in Table 2.Note that the number of neurons n (1) in the first layer and the number of layers Fig. 2 Training and testing accuracy against size of one instance of HNF using ELM feature vector in the first layer.Size of an L-layer HNF is represented by the number of random matrix-based nodes, counted as n (1) + L l=2 2n (l) .Here, L = 3 for all three datasets.The number of nodes in the first layer (n (1) ) is set according to Table 1 for each dataset Fig. 3 Training and testing accuracy against size of HNF using DCT in every layer.Size of an L-layer HNF is represented by the number of DCT-based nodes, counted as L l=1 2n (l) .Here, L = 5 for all three datasets.The number of nodes in the first layer (2 n (1) ) is set according to Table 2 for each dataset are the same as Table 1.The performance improvement in each layer of HNF is given in Figs. 3 and 4. It can be seen that by using DCT in the proposed HNF, it is also possible to improve the performance with a few layers.
Finally, we compare the performance of the DCT-based HNF and that of the random matrix-based HNF as shown in Tables 1 and 2. We can see that using DCT as the weight matrix is as powerful as using random weights in these three datasets.

Computational complexity
Finally, we compare test classification accuracy and computational complexity of HNF with the backpropagation over the same learned HNF.We report training time of each method in seconds.We run our experiments on a server with multiprocessors and 256 GB RAM.The optimization method used for backpropagation is ADAM [31] from Ten-sorFlow.The learning rate of ADAM is chosen via cross-validation, and the number of epochs is fixed to 1000 in all the experiments.
We construct HNF by using random weights and use the same number of layers and nodes as in Table 1.Note that we do not use ELM feature vector in the first layer for this experiments, although it is possible to use it in order to improve the performance.The results are shown in Table 3.As expected, backpropagation can improve the performance, except for Shuttle, at the cost of a significantly higher computational complexity.HNF, on the other hand, does not require cross-validation and only performs training at the last Fig. 4 Training and testing accuracy against size of HNF using ELM feature vector in the first layer and DCT in the next layers.Size of an L-layer HNF is represented by the number of nodes, counted as n (1) + L l=2 2n (l) .Here, L = 3 for all three datasets.The number of nodes in the first layer (n (1) ) is set according to  At this point, we also provide the reported classification performance of scattering network on MNIST dataset for the sake of completeness.Scattering network with principal component analysis (PCA) [14] over a modulus of windowed Fourier transforms yields 98.2% test classification accuracy for a spatial support equal to 8.This result shows that scattering network can outperform HNF at the cost of a higher complexity of using several scattering integrals in each layer.Note that HNF only uses a random choice of a Gaussian distribution as the weight matrix in each layer.Besides, scattering network requires accurate choice of several hyperparameters such as the spatial support, number of filter banks, and type of the transforms, which can be crucial for the performance.For example, in our experiments, a scattering network with PCA over a modulus of 2D Morlet wavelets provides 94% accuracy, at best, for a spatial support of 28.The training on the our server lasted 1158 s to yield such an accuracy, which highlights the learning speed of HNF in Table 3.The same network with a spatial support of 14 gives a performance of 56.03%, showing the importance of a precise cross-validation.

Conclusion
We show that by using a combination of orthonormal matrices and ReLU activation functions, it is possible to guarantee a monotonically decreasing training cost as the number of layers increases.The proposed method can be used by employing any other loss function, such as cross-entropy loss, as long as a linear projection is used after the ReLU activation function.Note that the same principle applies if instead of random matrices, we use any other real orthonormal matrices.Discrete cosine transform (DCT), Haar transform, and Walsh-Hadamard transform are examples of this kind.The proposed is a universal architecture in the sense that it can be applied to improve the performance of any other learning method which employs linear projection to predict the target.The normpreserving and invertibility of the architecture make the proposed HNF suitable for other applications such as auto-encoder design.
z 1 (i)z 2 (i) With similar calculations as in (28), we can derive the relationships in Eqs.(29a) and (29b).Since the summation i∈M c (z 1 ,z 2 ) z 1 (i)z 2 (i) is always non-positive, from (29a), we can see that where equality holds when M c = ∅, that means when sign patterns of z 1 and z 2 match exactly.From (29b), it can also be seen that where equality holds when |z 1 | = |z 2 |.

Table 1
Test classification accuracy of the proposed HNF for different datasets using random matrices

Table 2
Test classification accuracy of the proposed HNF for different datasets using DCT layer of the network, leading to a much faster training.Note that training time reported for backpropapation in Table3does not include cross-validation for the learning rate so that we can have a fair comparison with HNF.