Performance evaluation and analysis of distributed multi-agent optimization algorithms with sparsified directed communication

There has been significant interest in distributed optimization algorithms, motivated by applications in Big Data analytics, smart grid, vehicle networks, etc. While there have been extensive theory and theoretical advances, a proportionally small body of scientific literature focuses on numerical evaluation of the proposed methods in actual practical, parallel programming environments. This paper considers a general algorithmic framework of first and second order methods with sparsified communications and computations across worker nodes. The considered framework subsumes several existing methods. In addition, a novel method that utilizes unidirectional sparsified communications is introduced and theoretical convergence analysis is also provided. Namely, we prove R-linear convergence in the expected norm. A thorough empirical evaluation of the methods using Message Passing Interface (MPI) on a High Performance Computing (HPC) cluster is carried out and several useful insights and guidelines on the performance of algorithms and inherent communication-computational trade-offs in a realistic setting are derived.

there is a significant gap between theoretical studies of the methods and actual performance in practical infrastructures. For example, it is very hard to understand the relationship between iteration-wise convergence rate and actual communication and computational costs without empirical evaluation.
In this paper, a thorough and systematic empirical study of a class of distributed multi-agent optimization methods is carried out. All these methods are defined by different variants of sparsification of communications and/or computations along iterations. In more detail, we consider both first and second order methods that exhibit either unidirectional or bidirectional randomized sparsified communications. The considered sparsification strategies involve parameter p k that represents the probability that a node communicates at iteration k; the quantity p k is a design parameter that is either increasing, decreasing, or constant. These strategies give rise to a number of methods summarized in Table 1 1 . The studied class of methods subsumes several existing algorithms [12][13][14][15][16][17] but also includes several methods that have not been studied before, neither numerically nor analytically. Further contribution to the literature body of analytical results is the presented convergence analysis of the FUI method.
The main aims of the empirical evaluation are as follows: 1) to assess real benefits of sparsifying communications and/or computations across nodes, which have been proved to be beneficial theoretically [16]; 2) to compare different alternatives of the sparsification strategies; and 3) to compare unidirectional and bidirectional communication strategies. One of the main motivations for using sparsified, randomized communication is to reduce the amount of time spent on data exchange. The choice of omitting to communicate in some cases can also lead to savings in bandwidth or transmission power of wireless devices, when considering wireless networks. Using randomized communication at the level of algorithm design is a well established topic, where, e.g., gossip [18] is an outstanding example. It is also of interest to explore the case when communication sparsification is not fully determined by the algorithm designer, but instead is dictated by random link failures (e.g., packet dropouts in wireless networks).
The underlying implementation framework is the MPI (Message Passing Interface, [19]) running on an HPC computer cluster, which is a standard and widely adopted computational system.  [13,16] The rest of the paper is organized as follows. An overview of the work related to this topic is presented in Subsection 1.1. We briefly describe the optimization model and the algorithmic framework for the Distributed Quasi Newton (DQN) method in Subsection 2.1. The algorithmic framework is described in Subsection 2.2. Convergence analysis of the novel FUI method that uses unidirectional communication is presented in Subsection 2.3. Implementation and infrastructure are described in Subsection 2.4. The simulation setup is described in Subsection 2.5 and the proposed set of methods that fit the introduced algorithm is presented in Subsection 2.6. The results are highlighted in Section 3, organized in the following manner: Subsection 3.1 contains an analysis of different graph types; Subsection 3.2 is dedicated to the analysis of scaling properties of the algorithm; Subsection 3.3 contains an analysis and comparison of execution time, regarding all the considered methods; an analysis and discussion on the effects of communication sparsification is given in Subsections 3.4 and 3.5 is dedicated to the comparison of Algorithm 1 to ADMM (Alternating Direction Method of Multipliers, see [11]). Finally, the conclusions are made in Section 4.

Related work
There has been a large body of literature on theoretical development of distributed optimization methods. A proportionally much smaller body of scientific literature focuses on testing and evaluation of the methods over actual distributed infrastructures. Existing studies include, e.g., [20], for the dual averaging method, and [11] for the alternating direction method of multipliers. Distributed convex optimization by alternating direction method of multipliers is studied in [11]. A stochastic, efficient quasi-Newton method, using the BFGS update formula, is introduced in [21] in order to take advantage of the curvature information. A fast distributed proximal gradient method is proposed in [22]. An incremental sub-gradient approach, suitable for distributed optimization in networked systems, is presented in [23]. An important aspect in evaluation in distributed optimization is the nature of the network of nodes itself. The effects of this aspect are highlighted in [24].
More recently, there have been works that include MPI-based empirical studies of the methods. In [25] an asynchronous subgradient-push method is proposed and its performance is evaluated on an MPI cluster, whereas in [26] an empirical comparison of several distributed first order methods is given. An exact asynchronous method and its performance analysis using an MPI cluster are presented in [27]. A theoretical and empirical study of communication and computational trade-offs for the distributed dual averaging method is given in [20]. Finally, the focus of [28] is on the distributed dual averaging method with several useful guidelines about practical design and performance of the methods.
With respect to existing studies, this paper differs along several lines. First, it considers a different class of methods with respect to existing empirical studies, as the considered methods include various strategies for communication sparsification (see [12][13][14][15][16][17]). Second, it provides a novel insights into comparison among different sparsification strategies, as well as the practical benefits with respect to the corresponding always-communicating benchmark. The empirical results show that communication sparsification can lead to significant execution time reductions. To the best of our knowledge, this is the first empirical evaluation reported on the class of algorithms with sparsified communications presented in [16]. Also, a theoretical convergence analysis of the FUI method is carried out in this paper. While [14] also considers unidirectional communications, it studies the specific problem of distributed estimation, which translates into quadratic objective functions and stochastic gradient updates. In contrast, our analysis considers generic strongly convex costs. An important aspect of the framework considered in this paper is that it includes both first and second order methods.

Optimization and network models
Consider a connected network of n nodes, where each node has access to a convex cost function f i : IR s → IR, and assume that f i is known only by the node i. The goal is to solve the following unconstrained optimization problem With problem (1) a graph G = (N, E) can be associated, where N = {1, ..., n} is the set of nodes, and E is the set of edges {i, j}, i.e., pairs of nodes i and j that can directly communicate.
As it will be seen, graph G represents a collection of realizable communication links; actual algorithms that are considered here may utilize subsets of these links over iterations in possibly unidirectional, sparsified communications.
The assumption is that G is connected, undirected and simple (no self nor multiple links). Denote by i the neighborhood set of node i and associate an n × n symmetric, doubly stochastic matrix W with graph G. The matrix W has positive diagonal entries and respects the sparsity pattern of graph G, i.e., for i = j, W ij = 0 if and only if {i, j} / ∈ E.
On the other hand, it is important to note, that in the cases of unidirectional communication between the nodes, the graph instantiations over iterations (subgraphs of G) can be directed. Also, assume that W ii > 0, ∀i. It can be shown that λ 1 (W ) = 1, andλ 2 (W ) < 1, where λ 1 (W ) is the largest eigenvalue of W, andλ 2 (W ) is the modulus of the eigenvalue of W that is second largest in modulus. Denote by λ n (W ) the smallest eigenvalue of W. There also holds |λ n (W )| < 1.
The following optimization problem can be associated with (1), where F(x) := n i=1 f i (x i ) and I is the identity matrix. Moreover, I−W is positive semidefinite, so (I − W )x = 0 is equivalent to (I − W ) 1/2 x = 0. Therefore, (3) can be replaced by In other words, the constraint Wx = x enforces that all the feasible x i 's in optimization problem (3) are mutually equal, thus ensuring the equivalence of (1) and (3) and the equivalence of (1) and (4). Further, a penalty reformulation of (4) can be stated as where 1 α is the penalty parameter. Therefore (5) represents a quadratic penalty reformulation of the original problem (1). After standard manipulations with the penalty part we obtain which is the same as (2) for s = 1. These considerations are easily generalized for s > 1.
It is well known, [1], that the solutions of (1) and (2) are mutually close. More specifically, for each i = 1, ..., n, is the solution to (2). In more details, Theorem 4 in [29] says that under strongly convex local costs f i 's with Lipschitz continuous gradients (see ahead Assumption 2.1 for details), the following holds, for all i = 1, ..., n: Here, x i is the minimizer of f i , L is the Lipschitz constant of the gradients of the f i 's, and μ is the strong convexity constant of the f i 's. The usefulness of formulation (2) is that it offers a solution that is close (on the order O(α)) to the desired solution of (1), while, unlike formulation (1), it is readily amenable for distributed implementation. A key insight known in the literature (see, e.g. [4,30]) is that applying a conventional (centralized) gradient descent method on (2) precisely recovers the distributed gradient method proposed in [1]. In other words, it has been shown that the distributed method in [1] -that approximately solves (1) -actually converges to the solution of (2). This insight has been significantly explored in the literature to derive several distributed methods, e.g., [4,5,16]. The class of methods considered in this paper also exploits this insight and therefore harnesses formulation (2) to carry out convergence analysis of the considered methods.

Algorithmic framework
The algorithmic framework is presented in this Section. The framework subsumes several existing methods [12][13][14][15][16][17], and it also includes a new method that will be analysed in this paper.
Within the considered framework, each node i in the network maintains x k i ∈ R s , its approximate solution to (1), where k is the iteration counter. In addition, let us associate a Bernoulli random variable z k i to each node i, that governs its communication activity at iteration k. If z k i = 1, node i communicates; if z k i = 0, node i does not exchange messages with neighbors. When z k i = 1, node i transmits x k i to all its neighbours j ∈ i , and it receives x k j , from all its active (transmitting) neighbours. The intuition behind the introduction of quantities z k i is the following. It has been demonstrated (see, e.g., [12]) that distributed methods to solve (1) and (2) exhibit certain "redundancy" in terms of the utilized communications. In other words, it is not necessary to activate all communication channels at all times for the algorithm to be convergent. Moreover, communication sparsification may lead to convergence speed improvements in terms of communication cost [12]. Communication sparsification and introduction of the z k i 's leads to less expensive but inexact algorithmic updates. A proper design of the z k i 's can lead to a positive resolution of the inexact-less expensive updates tradeoff; see, e.g., [12] for details.
Assume that the random variables z k i are independent both across nodes and across iterations. Denote by p k = Prob z k i = 1 , assumed equal across all nodes. The quantity p k is a design parameter of the method; strategies for setting p k are discussed further ahead. Intuitively, a large p k corresponds to "less inexact" updates but also to lower communication savings. With the considered algorithmic framework, solution estimate update at node i is as follows: x k+1 Here, α is a positive parameter, known as the step-size. The values of α differ depending on the input data (See ahead Section 2.5). Further, ξ k i,j is in general a function of z k i and z k j that encodes communication sparsification; and M k i is a local second order informationcapturing matrix, i.e., the Hessian approximation.
The following choices of the quantities ξ k i,j and M k i will be considered: 1) ξ k i,j = 1: no communication sparsification; 2) ξ k i,j = z k i · z k j bidirectional communication sparsification (that is, node i includes node j's solution estimate in its update only if both i and j are active in terms of communications); and 3) ξ k i,j = z k j (unidirectional communication); that is, node i includes node j's solution estimate in its update whenever node j transmits, irrespective of node i being transmission-active or not.
Regarding the matrix M k i , two options can be considered. First, M k i = I and this corresponds to first order methods, where one has no second order information included. Second option is M k i = D k i , where: This corresponds to the second order methods of DQN-type [16] (See ahead Section 2.6). We now provide intuition behind the generic method (9)-(10) and the choices of ξ k i,j 's and M k i 's. The method (9)-(10) corresponds to an inexact first order or an inexact second order method to solve (2) -and hence to approximately solve (1). The main source of inexactness is due to the sparsification (ξ k i,j 's). The bidirectional communication ξ k i,j = z k i · z k j is appealing as it preserves symmetry in the underlying weight matrix, which is known to be a beneficial theoretical property. On the other hand, the bidirectional sparsification is also wasteful in that a node ignores the received message from a neighbor if its own transmission to the same neighbor is not successful (see formula (9)). With respect to the choice first versus second order method (the choice of M k i ), the second order choice is computationally more expensive per iteration due to the Hessian computations; on the other hand, it can improve convergence speed iteration-wise.
The pseudocode for the general algorithmic framework is in Algorithm 1. A summary of all the considered methods within the framework of Algorithm 1 is given in Table 1.

Algorithm 1
Pseudocode for the proposed algorithmic framework Require: at each node i: repeat Each node i generates z k i and computes: (9) -(10) until a stopping criterion is met

Convergence analysis
In this section, a convergence analysis of the algorithm variant with unidirectional communications is carried out (See ahead Method FUI in Section 2.6). More precisely, in this section we assume the following choice of M k i and ξ k ij : To the best of our knowledge, except for a different estimation setting [14], this algorithm has not been studied before. The following assumptions are needed.
Notice that Assumption 2.1 (c) implies that α < (1 + λ n (W ))/L, which is equivalent to Let (a) Assume that the sequence {p k } converges to one as k → ∞. Then, the sequence of iterates x k converges to x • in the expected error norm, i.e., there holds: (b) Assume that the sequence {p k } converges to one geometrically as k → ∞, i.e., p k = 1 − δ k+1 , for all k, Then, there holds: where γ < 1 is a positive constant.
(c) Assume that p k ≥ p min for all k and for some p min ∈ (0, 1) and that the iterative sequence x k is uniformly bounded, i.e., there exists a constant C 1 > 0 such that E x k ≤ C 1 , for all k. Then, there holds: where C 2 = 2nC 1 αμ and θ ∈ (0, 1). Theorem 2.1 demonstrates that the Algorithm 1 with sparsified and unidirectional communications converges. More precisely, as long as the sequence p k converges to one, even arbitrarily slowly' , the sequence x k converges to the solution of (2) in the expected error norm sense. When the convergence of p k to one is geometric, we have that x k converges geometrically, i.e., at a linear rate. Finally, when p k stays bounded away from one, under the additional assumption that the sequence x k is uniformly bounded, the algorithm converges to a neighbourhood of the solution to (2), where the neighbourhood size is controlled by parameter p min (the closer p min to one, the smaller the error). This complements the existing results in [16] which concerns bidirectional communications.
Next, the proof of Theorem 2.1 will be carried out. To avoid notation clutter, let the dimension of the original problem (1) be s = 1. The proof relies on the fact that the method can be written as an inexact gradient method for minimization of . More specifically, it can be shown that the algorithm determined by (9) -(12) is equivalent to the following: where e k = e k 1 , ..., e k n T is given by and e k ∈ R n . Indeed, in view of (12), method (9)-(10) can be represented as where [ Thus, Therefore, for each component i, the error is determined by and (19) follows. Next we state and prove an important result. Here and throughout the paper, || · || denotes the vector 2-norm and the corresponding matrix norm.
Proof. Using (18) and the fact that ∇ (x • ) = 0 we obtain Further, there exists a symmetric positive definite matrix B k such that and its spectrum belongs to [μ, L ]. Thus, we obtain Notice that the Assumption 2.1 (c) implies that θ < 1 since (14) holds and L ≥ μ. Moreover, putting together (26) -(28), we obtain and applying the induction argument we obtain the desired result.
To complete the proof of parts (a) and (b) of Theorem 2.1, we need to derive an upper bound for ||e k || in the expected-norm sense. In order to do so, it is needed to establish the boundedness of iterates x k in the expected norm sense. Proof. The update rule (20) can be written equivalently as follows Introduce W k = W k − W , and rewrite (30) as Denote by x the minimizer of F. Then, by the Mean Value Theorem, there holds and Note that ||H k || ≤ L, by Assumption 2.1. Also, note that ||W − αH k || ≤ 1 − αμ, for α ≤ 1 2L . Therefore, the following can be stated Next, || W k || will be upper bounded. Note that Therefore, Taking expectation and using the fact that E z k j = p k , for all k, it can be concluded that for some positive constant C. Now, using independence of W k and x k , the following can be obtained from (34), As for the constant C e = 2n α (1 − p min ) C 1 . Proof. The proof follows straightforwardly from (19) and Lemma 2.3. Consider (24). Then, |e k i | can be upper bounded as follows: This yields: Taking expectation while using independence of z k j and x k , and using E x k ≤ C 1 ; Next, applying Lemma 3.1 in [31], it follows that as we wanted to prove. Let us now consider the part (b). Note that, in this case, we have that 1 − p k = δ k+1 , for all k. Specializing the bound in (43) to this choice of p k , the following holds and using the fact that s k := k t=1 θ k−t δ t converges to zero R-linearly (see Lemma II.1 from [16]), we obtain the result.
Finally, we prove part (c). Here, we upper bound the term (1 − p t−1 ) in (43) with (1 − p min ). For this case we obtain which completes the proof of part (c).

Implementation and infrastructure
A parallel implementation of Algorithm 1 was developed, using MPI [19]. The testing was performed on the AXIOM computing facility consisting of 16 nodes (8 x Intel i7 5820k 3.3GHz and 8 x Intel i7 8700 3.2GHz CPU -96 cores and 16GB DDR4 RAM/node) interconnected by a 10 Gbps network. Network configurations of grid and regular graphs are taken into consideration for graph G. A set of tests is conducted for the same data set with the same number of nodes for both types of graphsd-regular graphs and grid. The input data for the algorithm are read from binary files by the master process. The master process then scatters the data to other processes in equal pieces. If the data size is not divisible by the number of processes, then the remaining data is assigned to the master process. Therefore, the data are in the memory during computation and there is no Input/Output (I/O) operation performed while executing the algorithm.
The communication between the nodes is realized by creating a set of communicators -one for each node. The i-th communicator contains the i-th node as the master, and the nodes that are its neighbors. When sparsifying the communication between the nodes, the communicators should be recreated across the iterations, in order to ensure that only active nodes can send their results, see [11]. When using bidirectional communications, an active node is being included into its own communicator and into the communicators of its active neighbours. An inactive node is not included in the communicators of its neighbors, and also does not need its own communicator at the current iteration. In the case of unidirectional communication, an inactive node is included in its own communicator, but not in the communicators of its neighbors.
The data distribution process does not consume a large amount of the execution time. For example, considering a data set that contains a matrix of 5000 × 6000 elements and a vector of 5000 elements, the initial setup, including reading and scattering the data, as well as the creation of the communicators, takes about 0.3s per process. When compared to the overall run-time of the tests it represents a relatively small percentage. Regarding the case with the lowest execution time this percentage is 5%. On the other hand it is only 0.0007%, in the case with the highest execution time.
Regarding the stopping criteria, we let the algorithms run until ||∇ x k || ≤ , where = 0.01. Note that the gradient ∇ x k is not computable by any node in a distributed graph G in general. In our implementation ∇ x k is maintained by the master node.
While not being a realistic stopping criterion in a fully distributed setting, it allows us to adequately compare different algorithmic strategies, The implementation relies on efficient LAPACK [32] and BLAS [33] linear algebra operations, applied on the nodes, while performing local calculations.

Simulation setup
The tests were performed on two types of graphs: d-regular and grid graphs with different number of nodes. We constructed the d-regular graphs in the following way. For 8-regular graphs, for each number of nodes n, we construct an 8-regular graph starting from a ring graph with nodes {1, 2, ..., n} and then adding to each node i the links to the nodes i − 4, i−3, i−2, and i+2, i+3, and i+4, where the subtractions and additions here are modulo n. The same principle was also used for 4-regular and 16-regular graphs used in this paper.
The tests are performed for the logistic loss functions given by Here, x = x T 1 , x 0 ∈ R s−1 × R represents the optimization variable and τ is the penalty parameter. The input values are a i ∈ R s−1 and b i ∈ R.
The testing is performed on different versions of Algorithm 1 with sparsified communication, for both bidirectional and unidirectional communication strategies (see ahead Table 1). The input data are represented as an r × (s − 1) sized matrix of features, and an r sized vector of labels. Both the matrix and the vector are then divided into n parts corresponding to the nodes as explained in the previous section. We then vary n (and the corresponding graph G) and investigate the performance of Algorithm 1.
The following data sets were used for testing.
• The Conll data set [34,35], that concerns language-independent named entity recognition. It has r = 220663 and s = 20 as the input data sizes. This data set is only used for comparing the performance of the algorithm between regular and grid graphs. • The Gisette data set [36][37][38], known as a handwritten digit recognition problem. Its input data sizes are r = 6000 and s = 5001. The data set is used for testing the different alternatives of the algorithm as well as for determining the most suitable value of d for d -regular graphs.
• The YearPredictionMSD train data set is used to predict the release year of a song from audio features [37,39,40]. Here r and s are r = 463715 and s = 91. The data set is also used for testing the different alternatives of the algorithm.
• The MNIST data set represents a database of handwritten digits [41,42], with input data sizes r = 60000 and s = 785. This data set is also used for testing the different alternatives of the algorithm.
• The Relative location of CT slices on axial axis data set (referred to as CT data set further on), containing features extracted from CT images [37,43,44]. The data sizes are r = 53500 and s = 386. This data set is also used for testing the different alternatives of the algorithm.
• The p53 Mutants data set [37,[45][46][47][48] (referred to as p53 data set further on) is used for modelling mutant p53 transcriptional activity (active or inactive) based on data extracted from biophysical simulations. The data set sizes are r = 31159 and s = 5410. The data set is also used for testing the different alternatives of the algorithm.
The parameters for Algorithm 1 are set according to the experimentally obtained conclusions. The value α can be defined as α = 1 KL , where L is the Lipschitz gradient constant and K ∈[ 10, 100], as proposed in [5]. The value of α can be fine-tuned according to the data set used for the tests. Increasing this value can lead to faster convergence. However, if the value is too large, then the algorithm might converge to a coarse solution neighbourhood. The values of α used for the mentioned 5 data sets are obtained experimentally and are listed below: • α = 0.0001 for the Gisette data set; • α = 0.001 for the p53 data set; • α = 0.1 for the YearPredictionMSD, MNIST, Conll and CT data sets.
A larger value of α = 0.1 can be applied in the cases of relatively small number of features, compared to the number of instances (i.e. rows of data). Here, in all the 4 cases for α = 0.1, the number of features is smaller than 1000.
The probability of communication p k is set as follows: p k = 1 − 0.5 k , where k is the iteration counter, or as p k = (k + 1) −1 . In other words, we consider an increasing and a decreasing sequence for the p k 's. The decreasing sequence for the probability is of interest  Table 1 lists the different methods as alternatives of Algorithm 1, considering the solution update, defined in (9), (10) and (11). The naming convention for the methods was already described in the introductory section (see page 2).

Description of the methods
Method SBC represents the initial version of the algorithm, used as the benchmark here, where Method FBC is its first order equivalent. These methods does not utilize any communication sparsification.
Note that Methods FBI, FBD, FUI, FUD, SBI, SBD, SUI, SUD use sparsification with either increasing or decreasing communication probabilities p k . The rationale for choosing a linearly increasing p k and a sub-linearly decreasing p k is adopted according to insights available in the literature; see, e.g., [16], [14]. While it is possible to consider other choices and fine-tuning of the sequence p k , this topic is outside of the paper's scope. Our primary aim is to investigate the feasibility and performance of increasing and of decreasing sequence of p k 's relative to the always-communicating strategy (Method SBC and Method FBC), as well as relative to the unidirectional versus bidirectional communication, and the first order versus second order methods.
The convergence analysis for the novel method with unidirectional communication Method FUI is presented here, where Methods SUI and SUD, that also rely on unidirectional communication, remain open for theoretical analysis. The Methods FBI, FBD, SBI and SBD, using bidirectional communication are already analysed in the literature (see [12][13][14][15][16][17]).
The listed methods and data sets described before are used to derive some empirical conclusions. As expected, the analysis of obtained results provides some insights about the optimal number of nodes for different setups. Also, the advantages of particular methods are clearly visible and one can estimate the usefulness of sparsification based on these results, keeping in mind that the tests might be influenced by the selection of data sets. Nevertheless, we believe that the obtained insights are useful.

Results and Discussion
We now present the experimental results. First, we investigate the behaviour of the Algorithm 1 for two types of graphsd-regular graphs and grid graphs. After that, we perform a sequence of tests using all the methods and the data sets stated above on d-regular graphs. These test are used to gain insight into effectiveness of different sparsification alternatives and differences between the first and second order methods in the framework of Algorithm 1.

Analysis of different graph types
The tests on different graph types are performed using the data sets Conll and CT with Method SBC. Figures 1 and 2 represent a performance comparison between the executions of the algorithm using different d-regular and grid graphs with Method SBC on CT and Conll data set, respectively. Observing Fig. 1, it can be clearly concluded that d-regular graphs perform better than grid graphs, which becomes more evident with increasing number of nodes. However, dregular graphs perform similarly on this data set for different values of d. The execution times for d = 4 and d = 8 are almost the same here. Therefore, it is important to examine the performance for different graphs on another data set. From Fig. 2, it is evident that the execution time decreases until the optimal number of nodes is reached, and starts to grow after that point. The same trend is present in Fig. 1, but the optimal number of nodes is higher here. Figure 2 clearly shows the difference between d-regular and grid graphs. It also identifies 8-regular graphs as the most suitable choice for different number of nodes. Therefore, in the rest of the paper we consider 8-regular graphs, based on the derived empirical conclusions. For the cases, where the number of nodes n is smaller than 8, the value d = n − 1 is used, leading to all-to-all graphs for n < 8.

Analysis of scaling properties
A sequence of tests with different number of computational nodes n is performed next to give an insight into the most suitable number of nodes for the data sets. Figures 3 and  4 represent examples of the scaling properties of the algorithm, for Method FBI on the YearPredictionMSD data set and for Method FUI on the MNIST data set, respectively. Here, when varying n we keep the graph structure to the 8-regular graph. The optimal number of nodes can be identified in both cases. These graphs obviously show the usual expected trend where the execution time decreases until the optimal number of nodes is reached, while after that further enhancement in number of nodes leads to time increase. Intuitively, the larger number of workers n means that the same overall workload is parallelized over more workers, leading to time reduction. However, the beneficial effect is lost for sufficiently large n when the communication overhead time starts to dominate. Interestingly, the optimal number of nodes is mostly constant for the first order methods as well as for the second order methods, irrespective of the data set. Table 2 shows the percentages of successful tests for all methods, i.e., of tests that satisfy the stopping criteria x k < 0.01 within the maximal execution time of 15 hours. In the failed tests the iterations are also approaching the solution, but they did not reach  Table 3 lists the execution time for each of the 10 methods, for the p53 data set and 20 nodes. The maximal execution time, i.e. the time for the slowest process, is taken into account for all the cases. As this amount of time can vary on different processes, all processes are waiting for the slowest one in the communicator in order to successfully exchange the data. All first order methods introduce significant execution time reduction. In this case, Method FBD has the best performance. When comparing Method FBC to Method SBC, it is clear that the computation of second order direction d k i significantly increases the execution time. Reducing the amount of communication across the iterations with Method FBD leads to even faster execution here. However, this behaviour may be highly dependent on the nature of the data set. The algorithms for p53 data set converge fast, within relatively small number of iterations. An equally important aspect here is also the fact that Method FUD, using unidirectional communication and decreasing communication probability performs better than Method FBI, with bidirectional and  As the nature of the data can highly influence the results, let us consider another example. Table 4 also contains the execution time for each of the 10 algorithms with 12 nodes for the MNIST data set (Method SUD does not converge for the given execution time limit). The behaviour of this data set differs from the p53 data set, observed in Table 3. For example, for 12 nodes Method FBD requires 4795 iterations to converge for the MNIST data set. When considering the p53 data set for the same setup with 12 nodes, it converges after only 3 iterations. However, the conclusions based on Table 4 are very similar to those from Table 3. In fact, it seems that the properties of particular methods are similar as long as the data sets are of similar volume. Figures 5 and 6 represent the execution times for first order methods with communication sparsification, i.e. Methods FBI, FBD, FUI, FUD for the CT and Gisette data set, respectively. From Fig. 5, it can be concluded that the optimal number of nodes for Methods FBI, FUI and FUD, is the same value n = 6. However, Method FBD performs differently. It shows lower execution time values generally, and its optimal number of nodes is n = 10. Similar conclusions could be made based on Fig. 6. Here, the optimal number of nodes for Methods FBI, FUI and FUD is again the same, n = 8. Method FBD also performs differently here, with lower execution time values, compared to other first order methods. The optimal number of nodes for the second order methods tends to be a larger  number. This is a direct consequence of the fact that the time consuming computations for the direction are faster with smaller portions of data on a node. Figure 7 represents the average cost reduction for different number of nodes, compared to the method of the weakest performances for each data set, where the average is taken across different data sets. These tests were performed for first order methods with communication sparsification, i.e., Methods FBI, FBD, FUI and FUD. For each data set, we divide the execution time for a given number of nodes with the worst execution time on the same data set, and compute the average value over methods for all the data sets, for different numbers on nodes. The conclusions based on this figure are consistent with the ones in Figs. 5 and 6. Method FBD has the best performance properties. Also, for each method, an optimal number of nodes can be identified. An evaluation of the algorithm execution with different sequences {p k } that stay bounded away from one as k grows large is presented in Fig. 8. The unidirectional, first order method was tested on the Conll data set, using α = 0.1. We observed the value of as in (2) during the execution of the algorithm. The value of decreases over time for all choices of p k , as expected. The zoomed part of the figure is included in order to present the last few seconds of the execution before reaching the minimal values of . Figure 8 shows that for different values of p k the iterative sequences do not converge to the same value, but also that for the constant p k choices the obtained limiting values are close.   The performance profile for a given β of the method M i is then calculated as the number of points divided by the number of the performed tests. For example, on the y-axis where the parameter β = 1, we obtain the statistical probability that the method is the best one among all the tested methods in terms of the execution time. It is noticeable that the value range on the x axis is large, on these figures. This is due to the fact that there are very large differences in execution times, ranging from a few seconds to values larger that 18000 seconds. Figure 9 shows the performance profile for all the tests on all data sets for the 10 methods, where Figs. 10 and 11 display the performance profile for first and second order methods, respectively. Figures 9 and 10 identify Method FBD as the best choice within the framework for Algorithm 1. Observing the methods without sparsification, i.e. Methods SBC and FBC, Fig. 9 indicates that the first order method, Method FBC, performs better than the second order method, Method SBC. The same is true if we consider the methods with sparsification. Considering methods with decreasing communication probability and using bidirectional communication, Method FBD performs clearly much better than Method SBD. When comparing the other first and second order methods using the same sparsification (Method FBI and Method SBI, Method FUI and Method SUI, Method FUD and Method SUD), first order methods performs better in 61% of test cases. Also, the convergence rate for first order methods is higher (See Table 2). It can also be concluded that the sparsification of second order methods gives no advantages probably because the computation of the second order direction is time consuming. Furthermore, with communication sparsification the second order information is incorporated only partially and hence it does not provide enough advantage to compensate for computational load. On the other hand, communication sparsification can be beneficial for the first order methods, as evidenced by Method FBD. Generally, the best performing method is a first order method using the appropriate sparsification (bidirectional with decreasing communication probability), Method FBD. Figure 12 represents the performance profile for the tests on the Gisette data set. Here, Method FBD can be also identified as the most suitable, followed by Method FBC, and later by Method FUI, Method FBI and Method FUD, where the second order methods show poorer performance profiles. The dimension s for this data set is a large value s = 5001, resulting with time consuming calculations in the second order methods as the Hessian approximation matrices are of large dimensions. Therefore, the first order methods perform better than second order methods. Figure 14 displays the performance profile for the tests on the p53 data set. The conclusions for this data set, are very similar to those for Fig. 12. Similarly, the dimension s is also a larger value here, s = 5410, so the first order methods also performs better than the second order methods and again, Method FBD represents the best choice. Similar conclusions are emerging from Fig. 13, that represents the performance profile for the MNIST data set. The dimension s = 785 is around 6 times smaller here, compared to Gisette and p53 data sets, but the dimension r = 60000 is 10 times larger than for Gisette, and 2 times larger than for p53. This results with similar load when distributing the data and calculation of the second order direction is too costly again. The performance profile for the CT data set is displayed on Fig. 15. Here, the second order method Method SUI dominates, as the data set dimension s = 386 enables faster calculations of the second order direction. Comparison between the first and second order methods with the same communication sparsification yields the following conclusion -with the increasing communication probability the second order methods (Methods SBI and SUI) perform better (for both unidirectional and bidirectional communication). With the decreasing communication probability the first order methods (Methods FBD and FUD) give better results.    Figure 16 represents the performance profile for the YearPredictionMSD data set. Here, the dimension s = 91 is the smallest among the observed data sets. Therefore the second order methods performs better. But the sparsification does not improve the first order nor the second order methods for these data. This fact might be explained by the large dimension r = 463715, and therefore each node gets a large subset. Sparsifying the communication means ignoring a large portion of data on idle nodes, even if there is only one idle node. Thus, the gradient and Hessian are poorly approximated with idling.

Comparison of Algorithm 1 to ADMM
As problem (1) can be solved using the Alternating Direction Method of Multipliers (ADMM) [11], we compared Algorithm 1 to an ADMM implementation for logistic regression [50], on the Conll data set. More precisely, the method in [11] solves problem   (1) assuming the presence of a central node that communicates to all other nodes in the network. Henceforth, we adapt our algorithmic framework to the latter setting by letting the underlying network G to be fully connected and by setting the matrix W to have all its entries equal 1/n. The comparison between the second order Methods SBC and SBI and ADMM is shown in Table 5. We calculate the value of k = 1 n n i=1 f x k i , i.e., the average global cost in (1) averaged across all nodes' estimates, at the end of each iteration and we also measure the execution time. The second column in Table 5 represents the time required to satisfy the condition k −f * f * < 0.1. Here, f is numerically evaluated by ADMM. The rationale for this comparison is the following. All the methods converge to a neighborhood of the solution to (1), while ADMM converges to the exact solution of (1). Therefore, it is meaningful to compare the times that each method needs to reach a certain accuracy level, measured with respect to the cost function in (1). We tested all the   Performance profile for all 10 methods, for the tests performed on the CT data set methods and finally included the results for the best performing second order methods, i.e. Methods SBC and SBI. More precisely, Method SBI (a second order method with sparsification) is here the best performing method across all methods, while Method SBC is taken as the baseline (second order) method without sparsification. The fact that second order methods perform better than first order methods here is consistent with our previous conclusion that for smaller data sets, second order methods perform better than first order methods. It is clear that our second order methods converge faster than ADMM. Figure 17 shows the comparison between Method SBI and ADMM. Method SBI takes a larger number of significantly faster iterations, compared to ADMM, and hence results with shorter execution time needed to approach * .

Conclusions
In this paper, we consider a class of first and second order distributed optimization methods which utilize different versions of the communication sparsification strategies. While the framework subsumes several existing recent methods, we also introduce a novel method with unidirectional communication and give its convergence analysis. The paper provides a comprehensive empirical evaluation of various communication sparsification strategies on a HPC cluster. The tests of the algorithms without communication sparsification as well as with sparsified communications for different number of nodes [12,16] are described in this paper. The overall execution time is measured for different data sets in order to identify the most suitable methods for different setups.
The tests were performed on an MPI cluster with a usual configuration, where each cluster node contains one processor with 6 CPU cores, and the nodes are connected by an Ethernet network with speed of 10Gbps. Therefore, we do not expect variations in the behavior of the tested programs on other MPI clusters. On the other hand, execution and results may depend on the speed of the cores themselves and on the speed of the network. Given that we used a cluster with eighth-generation Core i7 cores, a performance jump can be expected if newer-generation CPUs and / or more powerful Xeon processors were used. This effect would refer to the shortening of the absolute execution time per core, but overall performance characteristics would remain the same. The scaling properties would still be present, as well as the preferences of certain methods for the specified scenarios regarding the data. The factor that can mostly affect the execution of the program is the speed of the network. In the case of clusters with higher network speeds, the general expectation is to achieve good program performance with more nodes than in our experiments. In these cases, communication saturation, which we have shown to be present in this type of algorithm implementations, could only occur with more nodes involved than in our experiments (see Fig. 3 and 4, as well as Fig. 5 and 6 and the corresponding descriptions in Sections 3.2 and 3.3 respectively, that show these results for our experiments). In other words, increasing the network speed would be a crucial factor that would increase The presented analysis also shows the expected scaling properties of the developed methods, starting from the differences in the optimal number of nodes for particular data set in consideration. The performance profile is used for the comparison between the proposed methods. It clearly identified that the first order methods perform much better with larger volumes of data, where for smaller data sets the second order methods are more suitable. For data sets with larger number of features (10 3 or more in our tests), the portions of data that the processes work on demand a significant amount of time to calculate the second order updates. If the number of samples is also larger (larger than 10 3 for our tests), it additionally burdens the execution. This is the reason why the first order methods perform better on larger data sets. The first order methods converge within a larger number of iterations, but those iterations are multiple times faster than for the second order methods. When the data set is smaller obtaining the second order information is not costly as the processes are working on small data portions. On these data sets the second order methods perform better as they converge within smaller number of iterations than the first order methods, while the second order iterations are negligibly slower than for the first order methods.
The method with bidirectional communication and decreasing communication probability (Method FBD) is identified as the best performing first order method. This method also shows the best performance globally, when observing all the tests on all 5 data sets. The fact that the bidirectional method performs better than the unidirectional method in most of the cases is a consequence of enabling exchange only between active nodes. Unidirectional methods require additional communication lines, in order to enable receiving data for idle nodes from their neighbors. The gain from solution update for the idle nodes can be slightly smaller than the cost of the communication to achieve that update. The decreasing probability enables more communication in the beginning of the execution. Later, the communication becomes sparse, but at the same time the solution becomes closer to the desired one, so that it does not require much communication any more. This is the reason why decreasing communication probability with a bidirectional method represents an optimal choice. However, the other methods with communication sparsification also showed satisfactory performance. The tests showed that, in general, communication sparsification can significantly improve performance. This serves as motivation for using communication sparsification in the described framework. It is also shown that communication sparsification does not introduce performance improvement with second order methods in general.
An important aspect of tests is the comparison between bidirectional and unidirectional communication. One conclusion is that unidirectional communication strategy works in the framework for Algorithm 1, and thus confirm the theoretical results. Besides that, this strategy yields lower execution time than the bidirectional communication strategy for some test cases. All these conclusions might be influenced by the considered data sets but nevertheless provide significant empirical evidence.
Further evaluation of unidirectional communication can be an interesting future task. Another challenging direction might be further implementation for very large data sets that cannot be held in memory.