 Research
 Open Access
 Published:
Performance evaluation and analysis of distributed multiagent optimization algorithms with sparsified directed communication
EURASIP Journal on Advances in Signal Processing volume 2021, Article number: 25 (2021)
Abstract
There has been significant interest in distributed optimization algorithms, motivated by applications in Big Data analytics, smart grid, vehicle networks, etc. While there have been extensive theory and theoretical advances, a proportionally small body of scientific literature focuses on numerical evaluation of the proposed methods in actual practical, parallel programming environments. This paper considers a general algorithmic framework of first and second order methods with sparsified communications and computations across worker nodes. The considered framework subsumes several existing methods. In addition, a novel method that utilizes unidirectional sparsified communications is introduced and theoretical convergence analysis is also provided. Namely, we prove Rlinear convergence in the expected norm. A thorough empirical evaluation of the methods using Message Passing Interface (MPI) on a High Performance Computing (HPC) cluster is carried out and several useful insights and guidelines on the performance of algorithms and inherent communicationcomputational tradeoffs in a realistic setting are derived.
Introduction
Distributed multiagent optimization is today a mature theoretical area, e.g. [1]. Several first [1–3] and second order [4–6] distributed methods have been proposed, and their theoretical properties have been well understood, e.g., in terms of theoretical iterationwise convergence rates.
Distributed multiagent optimization methods have a great potential in various application domains, including distributed machine learning [7], distributed control [8], vehicular networks [9], smart grid [10], etc. Relevant applications have been also demonstrated [11]. However, there is a restricted amount of scientific investigation of distributed multiagent optimization methods in realistic distributed computational/High Performance Computing (HPC) systems. Carrying out such studies is extremely important as there is a significant gap between theoretical studies of the methods and actual performance in practical infrastructures. For example, it is very hard to understand the relationship between iterationwise convergence rate and actual communication and computational costs without empirical evaluation.
In this paper, a thorough and systematic empirical study of a class of distributed multiagent optimization methods is carried out. All these methods are defined by different variants of sparsification of communications and/or computations along iterations. In more detail, we consider both first and second order methods that exhibit either unidirectional or bidirectional randomized sparsified communications. The considered sparsification strategies involve parameter p_{k} that represents the probability that a node communicates at iteration k; the quantity p_{k} is a design parameter that is either increasing, decreasing, or constant. These strategies give rise to a number of methods summarized in Table 1^{Footnote 1}. The studied class of methods subsumes several existing algorithms [12–17] but also includes several methods that have not been studied before, neither numerically nor analytically. Further contribution to the literature body of analytical results is the presented convergence analysis of the FUI method.
The main aims of the empirical evaluation are as follows: 1) to assess real benefits of sparsifying communications and/or computations across nodes, which have been proved to be beneficial theoretically [16]; 2) to compare different alternatives of the sparsification strategies; and 3) to compare unidirectional and bidirectional communication strategies. One of the main motivations for using sparsified, randomized communication is to reduce the amount of time spent on data exchange. The choice of omitting to communicate in some cases can also lead to savings in bandwidth or transmission power of wireless devices, when considering wireless networks. Using randomized communication at the level of algorithm design is a well established topic, where, e.g., gossip [18] is an outstanding example. It is also of interest to explore the case when communication sparsification is not fully determined by the algorithm designer, but instead is dictated by random link failures (e.g., packet dropouts in wireless networks).
The underlying implementation framework is the MPI (Message Passing Interface, [19]) running on an HPC computer cluster, which is a standard and widely adopted computational system.
The rest of the paper is organized as follows. An overview of the work related to this topic is presented in Subsection 1.1. We briefly describe the optimization model and the algorithmic framework for the Distributed Quasi Newton (DQN) method in Subsection 2.1. The algorithmic framework is described in Subsection 2.2. Convergence analysis of the novel FUI method that uses unidirectional communication is presented in Subsection 2.3. Implementation and infrastructure are described in Subsection 2.4. The simulation setup is described in Subsection 2.5 and the proposed set of methods that fit the introduced algorithm is presented in Subsection 2.6. The results are highlighted in Section 3, organized in the following manner: Subsection 3.1 contains an analysis of different graph types; Subsection 3.2 is dedicated to the analysis of scaling properties of the algorithm; Subsection 3.3 contains an analysis and comparison of execution time, regarding all the considered methods; an analysis and discussion on the effects of communication sparsification is given in Subsections 3.4 and 3.5 is dedicated to the comparison of Algorithm 1 to ADMM (Alternating Direction Method of Multipliers, see [11]). Finally, the conclusions are made in Section 4.
Related work
There has been a large body of literature on theoretical development of distributed optimization methods. A proportionally much smaller body of scientific literature focuses on testing and evaluation of the methods over actual distributed infrastructures. Existing studies include, e.g., [20], for the dual averaging method, and [11] for the alternating direction method of multipliers. Distributed convex optimization by alternating direction method of multipliers is studied in [11]. A stochastic, efficient quasiNewton method, using the BFGS update formula, is introduced in [21] in order to take advantage of the curvature information. A fast distributed proximal gradient method is proposed in [22]. An incremental subgradient approach, suitable for distributed optimization in networked systems, is presented in [23]. An important aspect in evaluation in distributed optimization is the nature of the network of nodes itself. The effects of this aspect are highlighted in [24].
More recently, there have been works that include MPIbased empirical studies of the methods. In [25] an asynchronous subgradientpush method is proposed and its performance is evaluated on an MPI cluster, whereas in [26] an empirical comparison of several distributed first order methods is given. An exact asynchronous method and its performance analysis using an MPI cluster are presented in [27]. A theoretical and empirical study of communication and computational tradeoffs for the distributed dual averaging method is given in [20]. Finally, the focus of [28] is on the distributed dual averaging method with several useful guidelines about practical design and performance of the methods.
With respect to existing studies, this paper differs along several lines. First, it considers a different class of methods with respect to existing empirical studies, as the considered methods include various strategies for communication sparsification (see [12–17]). Second, it provides a novel insights into comparison among different sparsification strategies, as well as the practical benefits with respect to the corresponding alwayscommunicating benchmark. The empirical results show that communication sparsification can lead to significant execution time reductions. To the best of our knowledge, this is the first empirical evaluation reported on the class of algorithms with sparsified communications presented in [16].
Also, a theoretical convergence analysis of the FUI method is carried out in this paper. While [14] also considers unidirectional communications, it studies the specific problem of distributed estimation, which translates into quadratic objective functions and stochastic gradient updates. In contrast, our analysis considers generic strongly convex costs. An important aspect of the framework considered in this paper is that it includes both first and second order methods.
Methods
Optimization and network models
Consider a connected network of n nodes, where each node has access to a convex cost function f_{i}:IR^{s}→IR, and assume that f_{i} is known only by the node i. The goal is to solve the following unconstrained optimization problem
With problem (1) a graph G=(N,E) can be associated, where N={1,...,n} is the set of nodes, and E is the set of edges {i,j}, i.e., pairs of nodes i and j that can directly communicate.
As it will be seen, graph G represents a collection of realizable communication links; actual algorithms that are considered here may utilize subsets of these links over iterations in possibly unidirectional, sparsified communications.
The assumption is that G is connected, undirected and simple (no self nor multiple links). Denote by Ω_{i} the neighborhood set of node i and associate an n×n symmetric, doubly stochastic matrix W with graph G. The matrix W has positive diagonal entries and respects the sparsity pattern of graph G, i.e., for i≠j,W_{ij}=0 if and only if {i,j}∉E. On the other hand, it is important to note, that in the cases of unidirectional communication between the nodes, the graph instantiations over iterations (subgraphs of G) can be directed. Also, assume that W_{ii}>0,∀i.
It can be shown that λ_{1}(W)=1, and \(\bar {\lambda }_{2}(W)<1\), where λ_{1}(W) is the largest eigenvalue of W, and \(\bar {\lambda }_{2}(W)\) is the modulus of the eigenvalue of W that is second largest in modulus. Denote by λ_{n}(W) the smallest eigenvalue of W. There also holds λ_{n}(W)<1.
The following optimization problem can be associated with (1),
where \(x=\left (x_{1}^{T},..., x_{n}^{T}\right)^{T} \in \!R^{ns}\) is the optimization variable partitioned into s×1 blocks x_{1},...,x_{n}. The reasoning behind this transformation is the following. Assume that s=1 for simplicity. Under the stated assumptions on matrix W, it can be shown that Wx=x if and only if x_{1}=x_{2}=...=x_{n}, so the problem (1) is equivalent to
where \(F(x):= \sum _{i=1}^{n} f_{i}(x_{i})\) and I is the identity matrix. Moreover, I−W is positive semidefinite, so (I−W)x=0 is equivalent to (I−W)^{1/2}x=0. Therefore, (3) can be replaced by
In other words, the constraint Wx=x enforces that all the feasible x_{i}’s in optimization problem (3) are mutually equal, thus ensuring the equivalence of (1) and (3) and the equivalence of (1) and (4). Further, a penalty reformulation of (4) can be stated as
where \(\frac {1}{\alpha }\) is the penalty parameter. Therefore (5) represents a quadratic penalty reformulation of the original problem (1). After standard manipulations with the penalty part we obtain
which is the same as (2) for s=1. These considerations are easily generalized for s>1.
It is well known, [1], that the solutions of (1) and (2) are mutually close. More specifically, for each \(i=1,...,n, x_{i}^{\circ }x^{\ast }=O(\alpha)\) where x^{∗} is the solution to (1), \(x^{\bullet }=\left (\left (x_{1}^{\circ }\right)^{T},...,\left (x_{n}^{\circ }\right)^{T}\right)^{T}\) is the solution to (2). In more details, Theorem 4 in [29] says that under strongly convex local costs f_{i}’s with Lipschitz continuous gradients (see ahead Assumption 2.1 for details), the following holds, for all i=1,...,n:
Here, \(x_{i}^{\prime }\) is the minimizer of f_{i}, L is the Lipschitz constant of the gradients of the f_{i}’s, and μ is the strong convexity constant of the f_{i}’s.
The usefulness of formulation (2) is that it offers a solution that is close (on the order O(α)) to the desired solution of (1), while, unlike formulation (1), it is readily amenable for distributed implementation. A key insight known in the literature (see, e.g. [4, 30]) is that applying a conventional (centralized) gradient descent method on (2) precisely recovers the distributed gradient method proposed in [1]. In other words, it has been shown that the distributed method in [1] – that approximately solves (1) – actually converges to the solution of (2). This insight has been significantly explored in the literature to derive several distributed methods, e.g., [4, 5, 16]. The class of methods considered in this paper also exploits this insight and therefore harnesses formulation (2) to carry out convergence analysis of the considered methods.
Algorithmic framework
The algorithmic framework is presented in this Section. The framework subsumes several existing methods [12–17], and it also includes a new method that will be analysed in this paper.
Within the considered framework, each node i in the network maintains \(x_{i}^{k} \in \!R^{s}\), its approximate solution to (1), where k is the iteration counter. In addition, let us associate a Bernoulli random variable \(z_{i}^{k}\) to each node i, that governs its communication activity at iteration k. If \(z_{i}^{k}=1\), node i communicates; if \(z_{i}^{k}=0\), node i does not exchange messages with neighbors. When \(z_{i}^{k}=1\), node i transmits \(x_{i}^{k}\) to all its neighbours j∈Ω_{i}, and it receives \(x_{j}^{k}\), from all its active (transmitting) neighbours.
The intuition behind the introduction of quantities \(z_{i}^{k}\) is the following. It has been demonstrated (see, e.g., [12]) that distributed methods to solve (1) and (2) exhibit certain “redundancy” in terms of the utilized communications. In other words, it is not necessary to activate all communication channels at all times for the algorithm to be convergent. Moreover, communication sparsification may lead to convergence speed improvements in terms of communication cost [12]. Communication sparsification and introduction of the \(z_{i}^{k}\)’s leads to less expensive but inexact algorithmic updates. A proper design of the \(z_{i}^{k}\)’s can lead to a positive resolution of the inexactless expensive updates tradeoff; see, e.g., [12] for details.
Assume that the random variables \(z_{i}^{k}\) are independent both across nodes and across iterations. Denote by \(p_{k} = Prob\left (z_{i}^{k}=1\right)\), assumed equal across all nodes. The quantity p_{k} is a design parameter of the method; strategies for setting p_{k} are discussed further ahead. Intuitively, a large p_{k} corresponds to “less inexact” updates but also to lower communication savings. With the considered algorithmic framework, solution estimate update at node i is as follows:
Here, α is a positive parameter, known as the stepsize. The values of α differ depending on the input data (See ahead Section 2.5). Further, \(\xi _{i,j}^{k}\) is in general a function of \(z_{i}^{k}\) and \(z_{j}^{k}\) that encodes communication sparsification; and \(M_{i}^{k}\) is a local second order informationcapturing matrix, i.e., the Hessian approximation.
The following choices of the quantities \(\xi _{i,j}^{k}\) and \(M_{i}^{k}\) will be considered: 1) \(\xi _{i,j}^{k}=1\): no communication sparsification; 2) \(\xi _{i,j}^{k} = z_{i}^{k} \cdot z_{j}^{k}\) bidirectional communication sparsification (that is, node i includes node j’s solution estimate in its update only if both i and j are active in terms of communications); and 3) \(\xi _{i,j}^{k} = z_{j}^{k}\) (unidirectional communication); that is, node i includes node j’s solution estimate in its update whenever node j transmits, irrespective of node i being transmissionactive or not.
Regarding the matrix \(M_{i}^{k}\), two options can be considered. First, \(M_{i}^{k}=I\) and this corresponds to first order methods, where one has no second order information included. Second option is \(M_{i}^{k}=D_{i}^{k}\), where:
This corresponds to the second order methods of DQNtype [16] (See ahead Section 2.6).
We now provide intuition behind the generic method (9)(10) and the choices of \(\xi _{i,j}^{k}\)’s and \(M_{i}^{k}\)’s. The method (9)(10) corresponds to an inexact first order or an inexact second order method to solve (2) – and hence to approximately solve (1). The main source of inexactness is due to the sparsification (\(\xi _{i,j}^{k}\)’s). The bidirectional communication \(\left (\xi _{i,j}^{k} = z_{i}^{k} \cdot z_{j}^{k}\right)\) is appealing as it preserves symmetry in the underlying weight matrix, which is known to be a beneficial theoretical property. On the other hand, the bidirectional sparsification is also wasteful in that a node ignores the received message from a neighbor if its own transmission to the same neighbor is not successful (see formula (9)). With respect to the choice first versus second order method (the choice of \(M_{i}^{k}\)), the second order choice is computationally more expensive per iteration due to the Hessian computations; on the other hand, it can improve convergence speed iterationwise.
The pseudocode for the general algorithmic framework is in Algorithm 1. A summary of all the considered methods within the framework of Algorithm 1 is given in Table 1.
Convergence analysis
In this section, a convergence analysis of the algorithm variant with unidirectional communications is carried out (See ahead Method FUI in Section 2.6). More precisely, in this section we assume the following choice of \(M_{i}^{k}\) and \(\xi _{ij}^{k}\):
To the best of our knowledge, except for a different estimation setting [14], this algorithm has not been studied before. The following assumptions are needed.
Assumption 2.1.
(a) Each function f_{i}:IR^{s}→IR,i=1,...,n is twice differentiable, strongly convex with strong convexity modulus μ>0, and it has Lipschitz continuous gradient with the constant L, L≥μ.
(b) The graph G is undirected, connected and simple.
(c) The step size α in (2) satisfies \(\alpha < \min \left \{\frac {1}{2L},\frac {1+\lambda _{n}(W)}{L}\right \}\).
By Assumption 2.1, Ψ is strongly convex with modulus μ. Moreover, the gradient is Lipschitz continuous with the constant
Notice that Assumption 2.1 (c) implies that α<(1+λ_{n}(W))/L, which is equivalent to
Let \(x^{k}=\left (\left (x_{1}^{k}\right)^{T},...,\left (x_{n}^{k}\right)^{T}\right)^{T}\). We have the following convergence result for the first order method with unidirectional communications.
Theorem 2.1.
Let {x^{k}} be a sequence generated by Algorithm 1, method FUI, and Assumption 2.1 holds. Then, the following results hold:
(a) Assume that the sequence {p_{k}} converges to one as k→∞. Then, the sequence of iterates {x^{k}} converges to x^{∙} in the expected error norm, i.e., there holds:
(b) Assume that the sequence {p_{k}} converges to one geometrically as k→∞, i.e., p_{k}=1−δ^{k+1}, for all k, Then, there holds:
where γ<1 is a positive constant.
(c) Assume that p_{k}≥p_{min} for all k and for some p_{min}∈(0,1)and that the iterative sequence {x^{k}} is uniformly bounded, i.e., there exists a constant C_{1}>0 such that E[∥x^{k}∥]≤C_{1}, for all k. Then, there holds:
where \(C_{2}=\frac {2nC_{1}} {\alpha \mu }\) and θ∈(0,1).
Theorem 2.1 demonstrates that the Algorithm 1 with sparsified and unidirectional communications converges. More precisely, as long as the sequence p_{k} converges to one, even arbitrarily slowly’, the sequence {x^{k}} converges to the solution of (2) in the expected error norm sense. When the convergence of p_{k} to one is geometric, we have that x^{k} converges geometrically, i.e., at a linear rate. Finally, when p_{k} stays bounded away from one, under the additional assumption that the sequence {x^{k}} is uniformly bounded, the algorithm converges to a neighbourhood of the solution to (2), where the neighbourhood size is controlled by parameter p_{min} (the closer p_{min} to one, the smaller the error). This complements the existing results in [16] which concerns bidirectional communications.
Next, the proof of Theorem 2.1 will be carried out. To avoid notation clutter, let the dimension of the original problem (1) be s=1. The proof relies on the fact that the method can be written as an inexact gradient method for minimization of Ψ. More specifically, it can be shown that the algorithm determined by (9) – (12) is equivalent to the following:
where \(e^{k}=\left (e^{k}_{1},...,e_{n}^{k}\right)^{T}\) is given by
and e^{k}∈ R^{n}. Indeed, in view of (12), method (9)(10) can be represented as
where
Thus,
Therefore, for each component i, the error is determined by
and (19) follows.
Next we state and prove an important result. Here and throughout the paper, · denotes the vector 2norm and the corresponding matrix norm.
Lemma 2.2.
Suppose that Assumption 2.1 holds. Then for each k we have
where x^{0} is the initial iterate and θ= max{1−αμ,αL_{Ψ}−1}<1.
Proof. Using (18) and the fact that ∇Ψ(x^{∙})=0 we obtain
Further, there exists a symmetric positive definite matrix B_{k} such that
and its spectrum belongs to [μ,L_{Ψ}]. Thus, we obtain
Notice that the Assumption 2.1 (c) implies that θ<1 since (14) holds and L≥μ. Moreover, putting together (26)  (28), we obtain
and applying the induction argument we obtain the desired result. \(\square \)
To complete the proof of parts (a) and (b) of Theorem 2.1, we need to derive an upper bound for e^{k} in the expectednorm sense. In order to do so, it is needed to establish the boundedness of iterates x^{k} in the expected norm sense.
Lemma 2.3.
Let Assumption 2.1 hold, and consider the setting of Theorem 2.1 (a). Then, there holds E[x^{k}]≤C_{x} for all k, where C_{x} is a positive constant.
Proof. The update rule (20) can be written equivalently as follows
Introduce \(\widetilde {W_{k}}=W_{k}W\), and rewrite (30) as
Denote by x^{′} the minimizer of F. Then, by the Mean Value Theorem, there holds
and
Note that H_{k}≤L, by Assumption 2.1. Also, note that W−αH_{k}≤1−αμ, for \(\alpha \leq \frac {1}{2L}\). Therefore, the following can be stated
Next, \(\widetilde {W_{k}}\) will be upper bounded. Note that
Therefore,
Taking expectation and using the fact that \(E\left [z_{j}^{k}\right ]=p_{k}\), for all k, it can be concluded that
for some positive constant \(\widetilde {C}\). Now, using independence of \(\widetilde {W_{k}}\) and x_{k}, the following can be obtained from (34),
As p_{k}→1, i.e., (1−p_{k})→0, it is clear that, for sufficiently large k, there holds
This implies that there exists a constant C_{x} such that E[x^{k}]≤C_{x}, for all k=0,1,.... \(\square \)
Applying Lemma 2.3, the following result is obtained.
Lemma 2.4.
Suppose that the Assumption 2.1 holds and E(∥x^{k}∥)≤C_{1} for all k and some constant C_{1}. Then the error sequence {∥e^{k}∥} satisfies
for the constant \(C_{e}=\frac {2n}{\alpha }\left (1p_{min}\right) C_{1}\).
Proof. The proof follows straightforwardly from (19) and Lemma 2.3. Consider (24). Then, \(e_{i}^{k}\) can be upper bounded as follows:
This yields:
Taking expectation while using independence of \(z_{j}^{k}\) and x^{k}, and using E(∥x^{k}∥)≤C_{1}; \(\sum _{j \in \Omega _{i}} \leq 1\); and \(E\left (1z_{j}^{k}\right) = 1p_{k}\), the result follows. \(\square \)
Now, Theorem 2.1 can be proved as follows.
Proof of Theorem 2.1. We first prove part (a). Taking expectation in Lemma 2.2, and using Lemma 2.4, we get
Next, applying Lemma 3.1 in [31], it follows that
as we wanted to prove.
Let us now consider the part (b). Note that, in this case, we have that 1−p_{k}=δ^{k+1}, for all k. Specializing the bound in (43) to this choice of p_{k}, the following holds
and using the fact that \(s_{k}:=\sum _{t=1}^{k} \theta ^{kt} \delta ^{t}\) converges to zero Rlinearly (see Lemma II.1 from [16]), we obtain the result.
Finally, we prove part (c). Here, we upper bound the term (1−p_{t−1}) in (43) with (1−p_{min}). For this case we obtain
which completes the proof of part (c).\(\square \)
Implementation and infrastructure
A parallel implementation of Algorithm 1 was developed, using MPI [19]. The testing was performed on the AXIOM computing facility consisting of 16 nodes (8 x Intel i7 5820k 3.3GHz and 8 x Intel i7 8700 3.2GHz CPU  96 cores and 16GB DDR4 RAM/node) interconnected by a 10 Gbps network.
Network configurations of grid and regular graphs are taken into consideration for graph G. A set of tests is conducted for the same data set with the same number of nodes for both types of graphs  dregular graphs and grid.
The input data for the algorithm are read from binary files by the master process. The master process then scatters the data to other processes in equal pieces. If the data size is not divisible by the number of processes, then the remaining data is assigned to the master process. Therefore, the data are in the memory during computation and there is no Input/Output (I/O) operation performed while executing the algorithm.
The communication between the nodes is realized by creating a set of communicators – one for each node. The ith communicator contains the ith node as the master, and the nodes that are its neighbors. When sparsifying the communication between the nodes, the communicators should be recreated across the iterations, in order to ensure that only active nodes can send their results, see [11]. When using bidirectional communications, an active node is being included into its own communicator and into the communicators of its active neighbours. An inactive node is not included in the communicators of its neighbors, and also does not need its own communicator at the current iteration. In the case of unidirectional communication, an inactive node is included in its own communicator, but not in the communicators of its neighbors.
The data distribution process does not consume a large amount of the execution time. For example, considering a data set that contains a matrix of 5000×6000 elements and a vector of 5000 elements, the initial setup, including reading and scattering the data, as well as the creation of the communicators, takes about 0.3s per process. When compared to the overall runtime of the tests it represents a relatively small percentage. Regarding the case with the lowest execution time this percentage is 5%. On the other hand it is only 0.0007%, in the case with the highest execution time.
Regarding the stopping criteria, we let the algorithms run until ∇Ψ(x^{k})≤ε, where ε=0.01. Note that the gradient ∇Ψ(x^{k}) is not computable by any node in a distributed graph G in general. In our implementation ∇Ψ(x^{k}) is maintained by the master node. While not being a realistic stopping criterion in a fully distributed setting, it allows us to adequately compare different algorithmic strategies,
The implementation relies on efficient LAPACK [32] and BLAS [33] linear algebra operations, applied on the nodes, while performing local calculations.
Simulation setup
The tests were performed on two types of graphs: dregular and grid graphs with different number of nodes. We constructed the dregular graphs in the following way. For 8regular graphs, for each number of nodes n, we construct an 8regular graph starting from a ring graph with nodes {1,2,...,n} and then adding to each node i the links to the nodes i−4,i−3,i−2, and i+2,i+3, and i+4, where the subtractions and additions here are modulo n. The same principle was also used for 4regular and 16regular graphs used in this paper.
The tests are performed for the logistic loss functions given by
Here, \( x = \left (x_{1}^{T}, x_{0}\right) \in \mathbb {R}^{s1} \times \mathbb {R} \) represents the optimization variable and τ is the penalty parameter. The input values are \(a_{i} \in \mathbb {R}^{s1} \) and \(b_{i} \in \mathbb {R}. \)
The testing is performed on different versions of Algorithm 1 with sparsified communication, for both bidirectional and unidirectional communication strategies (see ahead Table 1).
The input data are represented as an r×(s−1) sized matrix of features, and an r sized vector of labels. Both the matrix and the vector are then divided into n parts corresponding to the nodes as explained in the previous section. We then vary n (and the corresponding graph G) and investigate the performance of Algorithm 1.
The following data sets were used for testing.

The Conll data set [34, 35], that concerns languageindependent named entity recognition. It has r=220663 and s=20 as the input data sizes. This data set is only used for comparing the performance of the algorithm between regular and grid graphs.

The Gisette data set [36–38], known as a handwritten digit recognition problem. Its input data sizes are r=6000 and s=5001. The data set is used for testing the different alternatives of the algorithm as well as for determining the most suitable value of d for dregular graphs.

The YearPredictionMSD train data set is used to predict the release year of a song from audio features [37, 39, 40]. Here r and s are r=463715 and s=91. The data set is also used for testing the different alternatives of the algorithm.

The MNIST data set represents a database of handwritten digits [41, 42], with input data sizes r=60000 and s=785. This data set is also used for testing the different alternatives of the algorithm.

The Relative location of CT slices on axial axis data set (referred to as CT data set further on), containing features extracted from CT images [37, 43, 44]. The data sizes are r=53500 and s=386. This data set is also used for testing the different alternatives of the algorithm.

The p53 Mutants data set [37, 45–48] (referred to as p53 data set further on) is used for modelling mutant p53 transcriptional activity (active or inactive) based on data extracted from biophysical simulations. The data set sizes are r=31159 and s=5410. The data set is also used for testing the different alternatives of the algorithm.
The parameters for Algorithm 1 are set according to the experimentally obtained conclusions. The value α can be defined as \(\alpha =\frac {1}{KL}\), where L is the Lipschitz gradient constant and K∈[10,100], as proposed in [5]. The value of α can be finetuned according to the data set used for the tests. Increasing this value can lead to faster convergence. However, if the value is too large, then the algorithm might converge to a coarse solution neighbourhood. The values of α used for the mentioned 5 data sets are obtained experimentally and are listed below:

α=0.0001 for the Gisette data set;

α=0.001 for the p53 data set;

α=0.1 for the YearPredictionMSD, MNIST, Conll and CT data sets.
A larger value of α=0.1 can be applied in the cases of relatively small number of features, compared to the number of instances (i.e. rows of data). Here, in all the 4 cases for α=0.1, the number of features is smaller than 1000.
The probability of communication p_{k} is set as follows: p_{k}=1−0.5^{k}, where k is the iteration counter, or as p_{k}=(k+1)^{−1}. In other words, we consider an increasing and a decreasing sequence for the p_{k}’s. The decreasing sequence for the probability is of interest for analysis, as it gradually reduces the communication time over the iterations. This might require more iterations as the communication links are sparser. The increasing sequence for the probability may, on the other hand require less iterations, but those iterations are becoming increasingly more time consuming as the number of communication lines increases. It is of interest to investigate both possibilities.
The local second order informationcapturing matrix \(M_{i}^{k}\) can be included to the computation as \(M_{i}^{k}=D_{i}^{k}\), where \(D_{i}^{k}\) is defined as in (11), or it can be replaced by an identity matrix \(M_{i}^{k}=I\). Both possibilities are of interest for testing as it is of interest to establish empirically if the additional computation required to solve the system of linear equations in (9) pays off. With \( M_{i}^{k} = I \) we are performing (probably larger) number of cheaper iterations.
Description of the methods
Table 1 lists the different methods as alternatives of Algorithm 1, considering the solution update, defined in (9), (10) and (11). The naming convention for the methods was already described in the introductory section (see page 2).
Method SBC represents the initial version of the algorithm, used as the benchmark here, where Method FBC is its first order equivalent. These methods does not utilize any communication sparsification.
Note that Methods FBI, FBD, FUI, FUD, SBI, SBD, SUI, SUD use sparsification with either increasing or decreasing communication probabilities p_{k}. The rationale for choosing a linearly increasing p_{k} and a sublinearly decreasing p_{k} is adopted according to insights available in the literature; see, e.g., [16], [14]. While it is possible to consider other choices and finetuning of the sequence p_{k}, this topic is outside of the paper’s scope. Our primary aim is to investigate the feasibility and performance of increasing and of decreasing sequence of p_{k}’s relative to the alwayscommunicating strategy (Method SBC and Method FBC), as well as relative to the unidirectional versus bidirectional communication, and the first order versus second order methods.
The convergence analysis for the novel method with unidirectional communication Method FUI is presented here, where Methods SUI and SUD, that also rely on unidirectional communication, remain open for theoretical analysis. The Methods FBI, FBD, SBI and SBD, using bidirectional communication are already analysed in the literature (see [12–17]).
The listed methods and data sets described before are used to derive some empirical conclusions. As expected, the analysis of obtained results provides some insights about the optimal number of nodes for different setups. Also, the advantages of particular methods are clearly visible and one can estimate the usefulness of sparsification based on these results, keeping in mind that the tests might be influenced by the selection of data sets. Nevertheless, we believe that the obtained insights are useful.
Results and Discussion
We now present the experimental results. First, we investigate the behaviour of the Algorithm 1 for two types of graphs  dregular graphs and grid graphs. After that, we perform a sequence of tests using all the methods and the data sets stated above on dregular graphs. These test are used to gain insight into effectiveness of different sparsification alternatives and differences between the first and second order methods in the framework of Algorithm 1.
Analysis of different graph types
The tests on different graph types are performed using the data sets Conll and CT with Method SBC. Figures 1 and 2 represent a performance comparison between the executions of the algorithm using different dregular and grid graphs with Method SBC on CT and Conll data set, respectively.
Observing Fig. 1, it can be clearly concluded that dregular graphs perform better than grid graphs, which becomes more evident with increasing number of nodes. However, dregular graphs perform similarly on this data set for different values of d. The execution times for d=4 and d=8 are almost the same here. Therefore, it is important to examine the performance for different graphs on another data set. From Fig. 2, it is evident that the execution time decreases until the optimal number of nodes is reached, and starts to grow after that point. The same trend is present in Fig. 1, but the optimal number of nodes is higher here. Figure 2 clearly shows the difference between dregular and grid graphs. It also identifies 8regular graphs as the most suitable choice for different number of nodes. Therefore, in the rest of the paper we consider 8regular graphs, based on the derived empirical conclusions. For the cases, where the number of nodes n is smaller than 8, the value d=n−1 is used, leading to alltoall graphs for n<8.
Analysis of scaling properties
A sequence of tests with different number of computational nodes n is performed next to give an insight into the most suitable number of nodes for the data sets. Figures 3 and 4 represent examples of the scaling properties of the algorithm, for Method FBI on the YearPredictionMSD data set and for Method FUI on the MNIST data set, respectively. Here, when varying n we keep the graph structure to the 8regular graph. The optimal number of nodes can be identified in both cases. These graphs obviously show the usual expected trend where the execution time decreases until the optimal number of nodes is reached, while after that further enhancement in number of nodes leads to time increase. Intuitively, the larger number of workers n means that the same overall workload is parallelized over more workers, leading to time reduction. However, the beneficial effect is lost for sufficiently large n when the communication overhead time starts to dominate. Interestingly, the optimal number of nodes is mostly constant for the first order methods as well as for the second order methods, irrespective of the data set.
Table 2 shows the percentages of successful tests for all methods, i.e., of tests that satisfy the stopping criteria ∥Φ(x^{k})∥<0.01 within the maximal execution time of 15 hours. In the failed tests the iterations are also approaching the solution, but they did not reach it within the given time limit. The results indicate that the first order methods are better choice in this environment as Method SUD is the one with the smallest number of successful tests. This fact can be easily explained as the method computes the expensive second order direction and the communication probability decreases while the communications are unidirectional. All this leads to the lack of communication epochs needed to ensure convergence during the time consuming iterations.
Analysis and comparison of execution time for the introduced methods
Table 3 lists the execution time for each of the 10 methods, for the p53 data set and 20 nodes. The maximal execution time, i.e. the time for the slowest process, is taken into account for all the cases. As this amount of time can vary on different processes, all processes are waiting for the slowest one in the communicator in order to successfully exchange the data. All first order methods introduce significant execution time reduction. In this case, Method FBD has the best performance. When comparing Method FBC to Method SBC, it is clear that the computation of second order direction \( d_{i}^{k} \) significantly increases the execution time. Reducing the amount of communication across the iterations with Method FBD leads to even faster execution here. However, this behaviour may be highly dependent on the nature of the data set. The algorithms for p53 data set converge fast, within relatively small number of iterations. An equally important aspect here is also the fact that Method FUD, using unidirectional communication and decreasing communication probability performs better than Method FBI, with bidirectional and increasing communication. Observing the execution times for the second order methods proves that introducing communication sparsification mostly does not pay off as the computation of the second order direction is time consuming.
As the nature of the data can highly influence the results, let us consider another example. Table 4 also contains the execution time for each of the 10 algorithms with 12 nodes for the MNIST data set (Method SUD does not converge for the given execution time limit). The behaviour of this data set differs from the p53 data set, observed in Table 3. For example, for 12 nodes Method FBD requires 4795 iterations to converge for the MNIST data set. When considering the p53 data set for the same setup with 12 nodes, it converges after only 3 iterations. However, the conclusions based on Table 4 are very similar to those from Table 3. In fact, it seems that the properties of particular methods are similar as long as the data sets are of similar volume.
Figures 5 and 6 represent the execution times for first order methods with communication sparsification, i.e. Methods FBI, FBD, FUI, FUD for the CT and Gisette data set, respectively. From Fig. 5, it can be concluded that the optimal number of nodes for Methods FBI, FUI and FUD, is the same value n=6. However, Method FBD performs differently. It shows lower execution time values generally, and its optimal number of nodes is n=10. Similar conclusions could be made based on Fig. 6. Here, the optimal number of nodes for Methods FBI, FUI and FUD is again the same, n=8. Method FBD also performs differently here, with lower execution time values, compared to other first order methods. The optimal number of nodes for the second order methods tends to be a larger number. This is a direct consequence of the fact that the time consuming computations for the direction are faster with smaller portions of data on a node.
Analysis of the effects of communication sparsification
Figure 7 represents the average cost reduction for different number of nodes, compared to the method of the weakest performances for each data set, where the average is taken across different data sets. These tests were performed for first order methods with communication sparsification, i.e., Methods FBI, FBD, FUI and FUD. For each data set, we divide the execution time for a given number of nodes with the worst execution time on the same data set, and compute the average value over methods for all the data sets, for different numbers on nodes. The conclusions based on this figure are consistent with the ones in Figs. 5 and 6. Method FBD has the best performance properties. Also, for each method, an optimal number of nodes can be identified.
An evaluation of the algorithm execution with different sequences {p_{k}} that stay bounded away from one as k grows large is presented in Fig. 8. The unidirectional, first order method was tested on the Conll data set, using α=0.1. We observed the value of Ψ as in (2) during the execution of the algorithm. The value of Ψ decreases over time for all choices of p_{k}, as expected. The zoomed part of the figure is included in order to present the last few seconds of the execution before reaching the minimal values of Ψ. Figure 8 shows that for different values of p_{k} the iterative sequences do not converge to the same value, but also that for the constant p_{k} choices the obtained limiting values are close.
Figures 9  16 displays the performance profile [49] for the described methods. Performance profiles enable evaluating the performance of different solvers running on a large number of tests. We consider the execution time as the comparison criterion. To compute the performance profile let us denote the execution time for a method M_{i} and test problem j by \(T_{i}^{j}\). Then, given the value on the xaxis β≥1, the method M_{i} obtains a point for the performance on test j if there holds \(T_{i}^{j}\leq \beta T^{j}_{min}\), where \(T^{j}_{min}\) is the smallest execution time of all tested methods considering that problem, i.e., \(T_{min}^{j}=\min _{i} T_{i}^{j}\). The performance profile for a given β of the method M_{i} is then calculated as the number of points divided by the number of the performed tests. For example, on the yaxis where the parameter β=1, we obtain the statistical probability that the method is the best one among all the tested methods in terms of the execution time. It is noticeable that the value range on the x axis is large, on these figures. This is due to the fact that there are very large differences in execution times, ranging from a few seconds to values larger that 18000 seconds.
Figure 9 shows the performance profile for all the tests on all data sets for the 10 methods, where Figs. 10 and 11 display the performance profile for first and second order methods, respectively. Figures 9 and 10 identify Method FBD as the best choice within the framework for Algorithm 1. Observing the methods without sparsification, i.e. Methods SBC and FBC, Fig. 9 indicates that the first order method, Method FBC, performs better than the second order method, Method SBC. The same is true if we consider the methods with sparsification. Considering methods with decreasing communication probability and using bidirectional communication, Method FBD performs clearly much better than Method SBD. When comparing the other first and second order methods using the same sparsification (Method FBI and Method SBI, Method FUI and Method SUI, Method FUD and Method SUD), first order methods performs better in 61% of test cases. Also, the convergence rate for first order methods is higher (See Table 2). It can also be concluded that the sparsification of second order methods gives no advantages probably because the computation of the second order direction is time consuming. Furthermore, with communication sparsification the second order information is incorporated only partially and hence it does not provide enough advantage to compensate for computational load. On the other hand, communication sparsification can be beneficial for the first order methods, as evidenced by Method FBD. Generally, the best performing method is a first order method using the appropriate sparsification (bidirectional with decreasing communication probability), Method FBD.
Figure 12 represents the performance profile for the tests on the Gisette data set. Here, Method FBD can be also identified as the most suitable, followed by Method FBC, and later by Method FUI, Method FBI and Method FUD, where the second order methods show poorer performance profiles. The dimension s for this data set is a large value s=5001, resulting with time consuming calculations in the second order methods as the Hessian approximation matrices are of large dimensions. Therefore, the first order methods perform better than second order methods. Figure 14 displays the performance profile for the tests on the p53 data set. The conclusions for this data set, are very similar to those for Fig. 12. Similarly, the dimension s is also a larger value here, s=5410, so the first order methods also performs better than the second order methods and again, Method FBD represents the best choice. Similar conclusions are emerging from Fig. 13, that represents the performance profile for the MNIST data set. The dimension s=785 is around 6 times smaller here, compared to Gisette and p53 data sets, but the dimension r=60000 is 10 times larger than for Gisette, and 2 times larger than for p53. This results with similar load when distributing the data and calculation of the second order direction is too costly again.
The performance profile for the CT data set is displayed on Fig. 15. Here, the second order method Method SUI dominates, as the data set dimension s=386 enables faster calculations of the second order direction. Comparison between the first and second order methods with the same communication sparsification yields the following conclusion  with the increasing communication probability the second order methods (Methods SBI and SUI) perform better (for both unidirectional and bidirectional communication). With the decreasing communication probability the first order methods (Methods FBD and FUD) give better results.
Figure 16 represents the performance profile for the YearPredictionMSD data set. Here, the dimension s=91 is the smallest among the observed data sets. Therefore the second order methods performs better. But the sparsification does not improve the first order nor the second order methods for these data. This fact might be explained by the large dimension r=463715, and therefore each node gets a large subset. Sparsifying the communication means ignoring a large portion of data on idle nodes, even if there is only one idle node. Thus, the gradient and Hessian are poorly approximated with idling.
Comparison of Algorithm 1 to ADMM
As problem (1) can be solved using the Alternating Direction Method of Multipliers (ADMM) [11], we compared Algorithm 1 to an ADMM implementation for logistic regression [50], on the Conll data set. More precisely, the method in [11] solves problem (1) assuming the presence of a central node that communicates to all other nodes in the network. Henceforth, we adapt our algorithmic framework to the latter setting by letting the underlying network G to be fully connected and by setting the matrix W to have all its entries equal 1/n. The comparison between the second order Methods SBC and SBI and ADMM is shown in Table 5. We calculate the value of \(\Phi ^{k}=\frac {1}{n}\sum _{i=1}^{n} f\left (x_{i}^{k}\right)\), i.e., the average global cost in (1) averaged across all nodes’ estimates, at the end of each iteration and we also measure the execution time. The second column in Table 5 represents the time required to satisfy the condition \(\frac {\Phi ^{k}f^{*}}{f^{*}} < 0.1\). Here, f^{⋆} is numerically evaluated by ADMM. The rationale for this comparison is the following. All the methods converge to a neighborhood of the solution to (1), while ADMM converges to the exact solution of (1). Therefore, it is meaningful to compare the times that each method needs to reach a certain accuracy level, measured with respect to the cost function in (1). We tested all the methods and finally included the results for the best performing second order methods, i.e. Methods SBC and SBI. More precisely, Method SBI (a second order method with sparsification) is here the best performing method across all methods, while Method SBC is taken as the baseline (second order) method without sparsification. The fact that second order methods perform better than first order methods here is consistent with our previous conclusion that for smaller data sets, second order methods perform better than first order methods. It is clear that our second order methods converge faster than ADMM. Figure 17 shows the comparison between Method SBI and ADMM. Method SBI takes a larger number of significantly faster iterations, compared to ADMM, and hence results with shorter execution time needed to approach Φ^{∗}.
Conclusions
In this paper, we consider a class of first and second order distributed optimization methods which utilize different versions of the communication sparsification strategies. While the framework subsumes several existing recent methods, we also introduce a novel method with unidirectional communication and give its convergence analysis.
The paper provides a comprehensive empirical evaluation of various communication sparsification strategies on a HPC cluster. The tests of the algorithms without communication sparsification as well as with sparsified communications for different number of nodes [12, 16] are described in this paper. The overall execution time is measured for different data sets in order to identify the most suitable methods for different setups.
The tests were performed on an MPI cluster with a usual configuration, where each cluster node contains one processor with 6 CPU cores, and the nodes are connected by an Ethernet network with speed of 10Gbps. Therefore, we do not expect variations in the behavior of the tested programs on other MPI clusters. On the other hand, execution and results may depend on the speed of the cores themselves and on the speed of the network. Given that we used a cluster with eighthgeneration Core i7 cores, a performance jump can be expected if newergeneration CPUs and / or more powerful Xeon processors were used. This effect would refer to the shortening of the absolute execution time per core, but overall performance characteristics would remain the same. The scaling properties would still be present, as well as the preferences of certain methods for the specified scenarios regarding the data. The factor that can mostly affect the execution of the program is the speed of the network. In the case of clusters with higher network speeds, the general expectation is to achieve good program performance with more nodes than in our experiments. In these cases, communication saturation, which we have shown to be present in this type of algorithm implementations, could only occur with more nodes involved than in our experiments (see Fig. 3 and 4, as well as Fig. 5 and 6 and the corresponding descriptions in Sections 3.2 and 3.3 respectively, that show these results for our experiments). In other words, increasing the network speed would be a crucial factor that would increase the number of nodes on which the proposed implementations can be executed efficiently. This could result in different values for the optimal number of cores in different setups, compared to the results on Figs. 3–6.
The presented analysis also shows the expected scaling properties of the developed methods, starting from the differences in the optimal number of nodes for particular data set in consideration. The performance profile is used for the comparison between the proposed methods. It clearly identified that the first order methods perform much better with larger volumes of data, where for smaller data sets the second order methods are more suitable. For data sets with larger number of features (10^{3} or more in our tests), the portions of data that the processes work on demand a significant amount of time to calculate the second order updates. If the number of samples is also larger (larger than 10^{3} for our tests), it additionally burdens the execution. This is the reason why the first order methods perform better on larger data sets. The first order methods converge within a larger number of iterations, but those iterations are multiple times faster than for the second order methods. When the data set is smaller obtaining the second order information is not costly as the processes are working on small data portions. On these data sets the second order methods perform better as they converge within smaller number of iterations than the first order methods, while the second order iterations are negligibly slower than for the first order methods.
The method with bidirectional communication and decreasing communication probability (Method FBD) is identified as the best performing first order method. This method also shows the best performance globally, when observing all the tests on all 5 data sets. The fact that the bidirectional method performs better than the unidirectional method in most of the cases is a consequence of enabling exchange only between active nodes. Unidirectional methods require additional communication lines, in order to enable receiving data for idle nodes from their neighbors. The gain from solution update for the idle nodes can be slightly smaller than the cost of the communication to achieve that update. The decreasing probability enables more communication in the beginning of the execution. Later, the communication becomes sparse, but at the same time the solution becomes closer to the desired one, so that it does not require much communication any more. This is the reason why decreasing communication probability with a bidirectional method represents an optimal choice. However, the other methods with communication sparsification also showed satisfactory performance. The tests showed that, in general, communication sparsification can significantly improve performance. This serves as motivation for using communication sparsification in the described framework. It is also shown that communication sparsification does not introduce performance improvement with second order methods in general.
An important aspect of tests is the comparison between bidirectional and unidirectional communication. One conclusion is that unidirectional communication strategy works in the framework for Algorithm 1, and thus confirm the theoretical results. Besides that, this strategy yields lower execution time than the bidirectional communication strategy for some test cases. All these conclusions might be influenced by the considered data sets but nevertheless provide significant empirical evidence.
Further evaluation of unidirectional communication can be an interesting future task. Another challenging direction might be further implementation for very large data sets that cannot be held in memory.
Availability of data and materials
The data sets used during the current study are available in:
• the UCI Machine Learning repository, [http://archive.ics.uci.edu/ml] [37] (Gisette [36, 38], YearPredictionMSD [39, 40], CT [43, 44] and p53 [45–48] data sets)
• the LanguageIndependent Named Entity Recognition II web site [https://www.clips.uantwerpen.be/conll2003/ner/] [34, 35] (the Conll data set)
• the MNIST DATABASE of Handwritten Digits web site [http://yann.lecun.com/exdb/mnist/] [41, 42] (the Mnist data set).
Declarations
Notes
 1.
Our convention for abbreviating the methods uses a three letter system, where the first letter represents whether the method is first or second order (F or S); the second letter represents the type of the communication (B for bidirectional and U for unidirectional); the third letter represents the communication sparsification type, i.e. the probability used for communication (I for increasing, D for decreasing and C for constant)
Abbreviations
 MPI:

Message Passing Interface
 HPC:

High Performance Computing
 DQN method:

Distributed Quasi Newton method
 I/O:

Input/Output
 ADMM:

Alternating Direction Method of Multipliers
References
 1
A. Nedic, A. Ozdaglar, Distributed subgradient methods for multiagent optimization. IEEE Trans. Autom. Control. 54(1), 48–61 (2009). https://doi.org/10.1109/tac.2008.2009515.
 2
S. S. Ram, A. Nedich, V. V. Veeravalli, Distributed stochastic subgradient projection algorithms for convex optimization. J. Optim. Theory Appl.147(3), 516–545 (2010). https://doi.org/10.1007/s1095701097377.
 3
D. Jakovetic, J. M. F. Xavier, J. M. F. Moura, Fast distributed gradient methods. IEEE Trans. Autom. Control. 59(5), 1131–1146 (2014). https://doi.org/10.1109/tac.2014.2298712.
 4
A. Mokhtari, Q. Ling, A. Ribeiro, Network newton distributed optimization methods. IEEE Trans. Signal Process.65(1), 146–161 (2017). https://doi.org/10.1109/tsp.2016.2617829.
 5
D. Bajović, D. Jakovetić, N. Krejić, N. Krklec Jerinkić, Newtonlike method with diagonal correction for distributed optimization. SIAM J. Optim.27(2), 1171–1203 (2017). https://doi.org/110.1137/15m1038049.
 6
A. Mokhtari, Q. Ling, A. Ribeiro, Network newtonpart II: Convergence rate and implementation. arXiv: Optimization and Control (2015). arXiv preprint arXiv:1504.06020.
 7
K. Zhang, Z. Yang, H. Liu, T. Zhang, T. Basar, in Proceedings of the 35th International Conference on Machine Learning, vol. 80, ed. by J. Dy, A. Krause. Fully Decentralized MultiAgent Reinforcement Learning with Networked Agents (PMLR, 2018), pp. 5872–5881.
 8
J. Shamma, Cooperative Control of Distributed MultiAgent Systems (WileyInterscience, USA, 2008). https://doi.org/10.1002/9780470724200.
 9
A. Salkham, R. Cunningham, A. Garg, V. Cahill, in 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, 2. A collaborative reinforcement learning approach to urban traffic control optimization (IEEESydney, NSW, Australia, 2008), pp. 560–566. https://doi.org/10.1109/WIIAT.2008.88.
 10
R. Roche, B. Blunier, A. Miraoui, V. Hilaire, A. Koukam, in IECON 2010  36th Annual Conference on IEEE Industrial Electronics Society. Multiagent systems for grid energy management: A short review (IEEEGlendale, 2010), pp. 3341–3346. https://doi.org/10.1109/IECON.2010.5675295.
 11
S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn.3(1), 1–122 (2011). https://doi.org/10.1561/2200000016.
 12
D. Jakovetić, D. Bajović, N. Krejić, N. Krklec Jerinkić, Distributed gradient methods with variable number of working nodes. IEEE Trans. Signal Process.64(15), 4080–4095 (2016). https://doi.org/10.1109/TSP.2016.2560133.
 13
A. K. Sahu, D. Jakovetic, D. Bajovic, S. Kar, Communicationefficient distributed strongly convex stochastic optimization: Nonasymptotic rates (2018). http://arxiv.org/abs/arXiv:1809.02920.
 14
A. K. Sahu, D. Jakovetic, D. Bajovic, S. Kar, in IEEE EUROCON 2019 18th International Conference on Smart Technologies. Communication Efficient Distributed Estimation Over Directed Random Graphs (IEEENovi Sad, 2019), pp. 1–5. https://doi.org/10.1109/EUROCON.2019.8861544.
 15
D. Jakovetić, D. Bajović, A. K. Sahu, S. Kar, in 2018 IEEE Conference on Decision and Control (CDC). Convergence Rates for Distributed Stochastic Optimization Over Random Networks (IEEEMiami Beach, 2018), pp. 4238–4245. https://doi.org/10.1109/CDC.2018.8619228.
 16
N. Krklec Jerinkić, D. Jakovetić, N. Krejić, D. Bajović, Distributed SecondOrder Methods With Increasing Number of Working Nodes. IEEE Trans. Autom. Control.65(2), 846–853 (2020). https://doi.org/10.1109/tac.2019.2922191.
 17
A. Sahu, D. Jakovetić, D. Bajović, S. Kar, in 2018 IEEE Conference on Decision and Control (CDC). Distributed Zeroth Order Optimization Over Random Networks: A KieferWolfowitz Stochastic Approximation Approach (IEEEMiami Beach, 2018), pp. 4951–4958. https://doi.org/10.1109/cdc.2018.8619044.
 18
S. Boyd, A. Ghosh, B. Prabhakar, D. Shah, Randomized gossip algorithms. IEEE/ACM Trans. Netw.14(SI), 2508–2530 (2006). https://doi.org/10.1109/TIT.2006.874516.
 19
Message Passing Interface Forum, MPI: A Messagepassing Interface Standard, Version 3.1 (HighPerformance Computing Center Stuttgart, University of Stuttgart, 2015).
 20
K. I. Tsianos, S. F. Lawlor, M. G. Rabbat, in Proceedings of the 25th International Conference on Neural Information Processing Systems, NIPS–12, 2. Communication/computation tradeoffs in consensusbased distributed optimization (Curran Associates Inc.Red Hook, NY, USA, 2012), pp. 1943–1951.
 21
R. H. Byrd, S. L. Hansen, J. Nocedal, Y. Singer, A stochastic quasinewton method for largescale optimization. SIAM J. Optim.26(2), 1008–1031 (2016). https://doi.org/10.1137/140954362.
 22
I. A. Chen, A. Ozdaglar, in 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton). A fast distributed proximalgradient method (IEEEMonticello, 2012), pp. 601–608. https://doi.org/10.1109/Allerton.2012.6483273.
 23
B. Johansson, M. Rabi, M. Johansson, A randomized incremental subgradient method for distributed optimization in networked systems. SIAM J. Optim.20(3), 1157–1170 (2009). https://doi.org/10.1137/08073038x.
 24
A. Nedić, A. Olshevsky, M. G. Rabbat, Network topology and communicationcomputation tradeoffs in decentralized optimization. Proc. IEEE. 106(5), 953–976 (2018). https://doi.org/10.1109/JPROC.2018.2817461.
 25
M. Assran, M. Rabbat, Asynchronous subgradientpush. Computing Research Repository, CoRR (2018). abs/1803.08950(2018). arXiv:1803.08950.
 26
M. Assran, M. Rabbat, in 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP). An empirical comparison of multiagent optimization algorithms (IEEEMontréal, 2017), pp. 573–577. https://doi.org/10.1109/GlobalSIP.2017.8309024.
 27
J. Zhang, K. You, AsySPA: An exact asynchronous algorithm for convex optimization over digraphs. IEEE Trans. Autom. Control. 65(6), 2494–2509 (2020). https://doi.org/10.1109/tac.2019.2930234.
 28
K. I. Tsianos, S. Lawlor, M. G. Rabbat, in 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton). Consensusbased distributed optimization: Practical issues and applications in largescale machine learning, (2012), pp. 1543–1550. https://doi.org/10.1109/Allerton.2012.6483403.
 29
K. Yuan, Q. Ling, W. Yin, On the convergence of decentralized gradient descent. SIAM J. Optim.26(3), 1835–1854 (2016). https://doi.org/10.1137/130943170.
 30
D. Jakovetić, J. M. F. Moura, J. Xavier, in 2012 IEEE 51st IEEE Conference on Decision and Control (CDC). Distributed nesterovlike gradient algorithms, (2012), pp. 5459–5464. https://doi.org/10.1109/CDC.2012.6425938.
 31
S. Sundhar Ram, A. Nedić, V. V. Veeravalli, Distributed stochastic subgradient projection algorithms for convex optimization. J. Optim. Theory Appl.147(3), 516–545 (2010). https://doi.org/10.1007/s1095701097377.
 32
E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. D. Croz, A. Greenbaum, S. Hammarling, A. McKenney, D. Sorensen, LAPACK UsersǴuide, 3rd edn. (Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, USA, 1999). https://doi.org/10.1137/1.9780898719604.
 33
L. Blackford, et al., An updated set of basic linear algebra subprograms (BLAS). ACM Trans. Math. Softw.28(2), 135–151 (2002). https://doi.org/10.1145/567806.567807.
 34
E. F. Tjong Kim Sang, F. De Meulder, LanguageIndependent Named Entity Recognition (II) (2005). https://www.clips.uantwerpen.be/conll2003/ner/. Accessed 30 May 2019.
 35
E. F. Tjong Kim Sang, F. De Meulder, in Proceedings of the Seventh Conference on Natural Language Learning at HLTNAACL 2003, CONLL –03, 4. Introduction to the CoNLL2003 shared task: Languageindependent named entity recognition (Association for Computational LinguisticsUSA, 2003), pp. 142–147. https://doi.org/10.3115/1119176.1119195.
 36
I. Guyon, UCI Machine Learning Repository, Gisette Data Set (2008). http://archive.ics.uci.edu/ml/datasets/gisette. Accessed 29 May 2019.
 37
D. Dua, C. Graff, UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences (2017). http://archive.ics.uci.edu/ml. Accessed 29 May 2019.
 38
I. Guyon, S. Gunn, A. BenHur, G. Dror, in Proceedings of the 17th International Conference on Neural Information Processing Systems, NIPS–04, 17. Result analysis of the NIPS 2003 feature selection challenge (MIT PressCambridge, MA, USA, 2004), pp. 545–552. https://eprints.soton.ac.uk/261923/.
 39
T. BertinMahieux, UCI Machine Learning Repository, YearPredictionMSD data set (2011). https://archive.ics.uci.edu/ml/datasets/YearPredictionMSD. Accessed 01 Sept 2019.
 40
T. BertinMahieux, D. P. W. Ellis, B. Whitman, P. Lamere, in Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011). The Million Song Dataset (University of MiamiMiami, 2011), pp. 591–596. https://doi.org/10.7916/D8NZ8J07.
 41
Y. LeCun, C. Cortes, “MNIST handwritten digit database, Yann LeCun, Corinna Cortes and Chris Burges”, THE MNIST DATABASE of handwritten digits (2005). http://yann.lecun.com/exdb/mnist/. Accessed 01 Sept 2019.
 42
L. Deng, The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Proc. Mag.29:, 141–142 (2012). https://doi.org/10.1109/MSP.2012.2211477.
 43
F. Graf, H. P. Kriegel, M. Schubert, S. Poelsterl, A. Cavallaro, UCI Machine Learning Repository: Relative location of CT slices on axial axis Data Set (2011). https://archive.ics.uci.edu/ml/datasets/Relative+location+of+CT+slices+on+axial+axis. Accessed 08 Sept 2019.
 44
F. Graf, H. P. Kriegel, M. Schubert, S. Pölsterl, A. Cavallaro, in International Conference on Medical Image Computing and ComputerAssisted Intervention, vol. 6892. 2d image registration in CT images using radial image descriptors (SpringerToronto, 2011), pp. 607–614.
 45
UCI Machine Learning Repository, p53 Mutants Data Set. (2010; accessed on: September 03, 2019). https://archive.ics.uci.edu/ml/datasets/p53+Mutants.
 46
S. Danziger, R. Baronio, L. Ho, L. Hall, K. Salmon, G. Hatfield, P. Kaiser, R. Lathrop, Predicting positive p53 cancer rescue regions using Most Informative Positive (MIP) active learning. PLoS Comput. Biol.5:, 1000498 (2009). https://doi.org/10.1371/journal.pcbi.1000498.
 47
S. A. Danziger, J. Zeng, Y. Wang, R. K. Brachmann, R. H. Lathrop, Choosing where to look next in a mutation sequence space: Active Learning of informative p53 cancer rescue mutants. Bioinformatics. 23(13), 104–114 (2007). https://doi.org/10.1093/bioinformatics/btm166.
 48
S. Danziger, S. J. Swamidass, J. Zeng, L. Dearth, Q. Lu, J. Chen, J. Cheng, V. Hoang, H. Saigo, R. Luo, P. Baldi, R. Brachmann, R. Lathrop, Functional census of mutation sequence spaces: The example of p53 cancer rescue mutants. IEEE/ACM Trans. Comput. Biol. Bioinform.3:, 114–25 (2006). https://doi.org/10.1109/TCBB.2006.22.
 49
E. D. Dolan, J. J. Moré, Benchmarking optimization software with performance profiles. Math. Program.91(2), 201–213 (2002). https://doi.org/10.1007/s101070100263.
 50
W. S. Zhang, “GitHubHaidYi/admml12logisticregression: ADMM l1/2” logistic reression using MPI and GSL”, HaidYi/admml12logisticregression. GitHub repository. https://github.com/HaidYi/admml12logisticregression. Accessed 15 May 2020.
Acknowledgements
This work is supported by the IBiDaaS project, funded by the European Commission under Grant Agreement No. 780787. This publication reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein. The authors gratefully acknowledge the AXIOM HPC facility at Faculty of Sciences, University of Novi Sad, where all the numerical simulations were run.
Funding
Not applicable.
Author information
Affiliations
Contributions
Authors’ contributions
LF developed the implementation of the algorithm and performed the empirical evaluations. DJ, NK and NKJ contributed with the theoretical advances and design of methods. SS contributed to the experimentation and methods design equally. All authors participated in the main research flow development and in writing and revising the manuscript. All authors read and approved the final manuscript.
Authors’ information
All authors are with Department of Mathematics and Informatics, Faculty of Sciences, University of NoviSad, Trg Dositeja Obradovića 4, 21000 Novi Sad, Serbia. email: (lidija.fodor@dmi.uns.ac.rs; dusan.jakovetic@dmi.uns.ac.rs; natasa@dmi.uns.ac.rs; natasa.krklec@dmi.uns.ac.rs; srdjan.skrbic@dmi.uns.ac.rs).
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Fodor, L., Jakovetić, D., Krejić, N. et al. Performance evaluation and analysis of distributed multiagent optimization algorithms with sparsified directed communication. EURASIP J. Adv. Signal Process. 2021, 25 (2021). https://doi.org/10.1186/s13634021007364
Received:
Accepted:
Published:
Keywords
 Distributed optimization
 High performance computing
 Performance evaluation