2.1 Optimization and network models
Consider a connected network of n nodes, where each node has access to a convex cost function fi:IRs→IR, and assume that fi is known only by the node i. The goal is to solve the following unconstrained optimization problem
$$ \text{min} f(x): = \sum_{i=1}^{n}f_{i}(x). $$
(1)
With problem (1) a graph G=(N,E) can be associated, where N={1,...,n} is the set of nodes, and E is the set of edges {i,j}, i.e., pairs of nodes i and j that can directly communicate.
As it will be seen, graph G represents a collection of realizable communication links; actual algorithms that are considered here may utilize subsets of these links over iterations in possibly unidirectional, sparsified communications.
The assumption is that G is connected, undirected and simple (no self nor multiple links). Denote by Ωi the neighborhood set of node i and associate an n×n symmetric, doubly stochastic matrix W with graph G. The matrix W has positive diagonal entries and respects the sparsity pattern of graph G, i.e., for i≠j,Wij=0 if and only if {i,j}∉E. On the other hand, it is important to note, that in the cases of unidirectional communication between the nodes, the graph instantiations over iterations (subgraphs of G) can be directed. Also, assume that Wii>0,∀i.
It can be shown that λ1(W)=1, and \(\bar {\lambda }_{2}(W)<1\), where λ1(W) is the largest eigenvalue of W, and \(\bar {\lambda }_{2}(W)\) is the modulus of the eigenvalue of W that is second largest in modulus. Denote by λn(W) the smallest eigenvalue of W. There also holds |λn(W)|<1.
The following optimization problem can be associated with (1),
$$ \underset{{x \in \!R^{n s}}}{\text{min}} \Psi(x):= \sum_{i=1}^{n} f_{i}(x_{i})+\frac{1}{2\alpha}\sum_{i < j}W_{ij}||x_{i}-x_{j}||^{2}, $$
(2)
where \(x=\left (x_{1}^{T},..., x_{n}^{T}\right)^{T} \in \!R^{ns}\) is the optimization variable partitioned into s×1 blocks x1,...,xn. The reasoning behind this transformation is the following. Assume that s=1 for simplicity. Under the stated assumptions on matrix W, it can be shown that Wx=x if and only if x1=x2=...=xn, so the problem (1) is equivalent to
$$ \min_{x \in \!R^{n s}} F(x), \; \text{ s.t.} \; (I-W)x=0, $$
(3)
where \(F(x):= \sum _{i=1}^{n} f_{i}(x_{i})\) and I is the identity matrix. Moreover, I−W is positive semidefinite, so (I−W)x=0 is equivalent to (I−W)1/2x=0. Therefore, (3) can be replaced by
$$ \min_{x \in \!R^{n s}} F(x), \; \text{ s.t.} \; (I-W)^{1/2}x=0, $$
(4)
In other words, the constraint Wx=x enforces that all the feasible xi’s in optimization problem (3) are mutually equal, thus ensuring the equivalence of (1) and (3) and the equivalence of (1) and (4). Further, a penalty reformulation of (4) can be stated as
$$ \min_{x \in \!R^{n s}} F(x)+\frac{1}{2 \alpha} x^{T} (I-W) x, $$
(5)
where \(\frac {1}{\alpha }\) is the penalty parameter. Therefore (5) represents a quadratic penalty reformulation of the original problem (1). After standard manipulations with the penalty part we obtain
$$ \min_{x \in \!R^{n s}} F(x) +\frac{1}{2\alpha}\sum_{i < j}W_{ij}\left(x_{i}-x_{j}\right)^{2}, $$
(6)
which is the same as (2) for s=1. These considerations are easily generalized for s>1.
It is well known, [1], that the solutions of (1) and (2) are mutually close. More specifically, for each \(i=1,...,n, ||x_{i}^{\circ }-x^{\ast }||=O(\alpha)\) where x∗ is the solution to (1), \(x^{\bullet }=\left (\left (x_{1}^{\circ }\right)^{T},...,\left (x_{n}^{\circ }\right)^{T}\right)^{T}\) is the solution to (2). In more details, Theorem 4 in [29] says that under strongly convex local costs fi’s with Lipschitz continuous gradients (see ahead Assumption 2.1 for details), the following holds, for all i=1,...,n:
$$ \begin{aligned} \|x_{i}^{\circ} - x^{\star} \| & \leq \left(\frac{\alpha L D}{1-\bar{\lambda}_{2}(W)} \right) \sqrt{4/c^{2} - 2 \alpha/c} + \frac{\alpha D}{1-\bar{\lambda}_{2}(W)} \\ & = O\left(\frac{\alpha}{1-\bar{\lambda}_{2}(W)}\right), \end{aligned} $$
(7)
$$ D = \sqrt{ 2 L \left(\sum_{i=1}^{n} f_{i}(0) - \sum_{i=1}^{n} f_{i}\left(x_{i}^{\prime}\right) \right) }; c = \frac{\mu L}{\mu + L}. $$
(8)
Here, \(x_{i}^{\prime }\) is the minimizer of fi, L is the Lipschitz constant of the gradients of the fi’s, and μ is the strong convexity constant of the fi’s.
The usefulness of formulation (2) is that it offers a solution that is close (on the order O(α)) to the desired solution of (1), while, unlike formulation (1), it is readily amenable for distributed implementation. A key insight known in the literature (see, e.g. [4, 30]) is that applying a conventional (centralized) gradient descent method on (2) precisely recovers the distributed gradient method proposed in [1]. In other words, it has been shown that the distributed method in [1] – that approximately solves (1) – actually converges to the solution of (2). This insight has been significantly explored in the literature to derive several distributed methods, e.g., [4, 5, 16]. The class of methods considered in this paper also exploits this insight and therefore harnesses formulation (2) to carry out convergence analysis of the considered methods.
2.2 Algorithmic framework
The algorithmic framework is presented in this Section. The framework subsumes several existing methods [12–17], and it also includes a new method that will be analysed in this paper.
Within the considered framework, each node i in the network maintains \(x_{i}^{k} \in \!R^{s}\), its approximate solution to (1), where k is the iteration counter. In addition, let us associate a Bernoulli random variable \(z_{i}^{k}\) to each node i, that governs its communication activity at iteration k. If \(z_{i}^{k}=1\), node i communicates; if \(z_{i}^{k}=0\), node i does not exchange messages with neighbors. When \(z_{i}^{k}=1\), node i transmits \(x_{i}^{k}\) to all its neighbours j∈Ωi, and it receives \(x_{j}^{k}\), from all its active (transmitting) neighbours.
The intuition behind the introduction of quantities \(z_{i}^{k}\) is the following. It has been demonstrated (see, e.g., [12]) that distributed methods to solve (1) and (2) exhibit certain “redundancy” in terms of the utilized communications. In other words, it is not necessary to activate all communication channels at all times for the algorithm to be convergent. Moreover, communication sparsification may lead to convergence speed improvements in terms of communication cost [12]. Communication sparsification and introduction of the \(z_{i}^{k}\)’s leads to less expensive but inexact algorithmic updates. A proper design of the \(z_{i}^{k}\)’s can lead to a positive resolution of the inexact-less expensive updates tradeoff; see, e.g., [12] for details.
Assume that the random variables \(z_{i}^{k}\) are independent both across nodes and across iterations. Denote by \(p_{k} = Prob\left (z_{i}^{k}=1\right)\), assumed equal across all nodes. The quantity pk is a design parameter of the method; strategies for setting pk are discussed further ahead. Intuitively, a large pk corresponds to “less inexact” updates but also to lower communication savings. With the considered algorithmic framework, solution estimate update at node i is as follows:
$$ d_{i}^{k}=-\left[\left(M_{i}^{k}\right)^{-1}\left[\alpha \nabla f_{i}\left(x_{i}^{k}\right)+\sum_{j \epsilon \Omega_{i}}W_{ij}\left(x_{i}^{k}-x_{j}^{k}\right) \xi_{i,j}^{k}\right]\right], $$
(9)
$$ x_{i}^{k+1}=x_{i}^{k}+d_{i}^{k}. $$
(10)
Here, α is a positive parameter, known as the step-size. The values of α differ depending on the input data (See ahead Section 2.5). Further, \(\xi _{i,j}^{k}\) is in general a function of \(z_{i}^{k}\) and \(z_{j}^{k}\) that encodes communication sparsification; and \(M_{i}^{k}\) is a local second order information-capturing matrix, i.e., the Hessian approximation.
The following choices of the quantities \(\xi _{i,j}^{k}\) and \(M_{i}^{k}\) will be considered: 1) \(\xi _{i,j}^{k}=1\): no communication sparsification; 2) \(\xi _{i,j}^{k} = z_{i}^{k} \cdot z_{j}^{k}\) bidirectional communication sparsification (that is, node i includes node j’s solution estimate in its update only if both i and j are active in terms of communications); and 3) \(\xi _{i,j}^{k} = z_{j}^{k}\) (unidirectional communication); that is, node i includes node j’s solution estimate in its update whenever node j transmits, irrespective of node i being transmission-active or not.
Regarding the matrix \(M_{i}^{k}\), two options can be considered. First, \(M_{i}^{k}=I\) and this corresponds to first order methods, where one has no second order information included. Second option is \(M_{i}^{k}=D_{i}^{k}\), where:
$$ D_{i}^{k}=\alpha \nabla^{2}f_{i}\left(x_{i}^{k}\right)+\left(1-W_{ii}\right)I. $$
(11)
This corresponds to the second order methods of DQN-type [16] (See ahead Section 2.6).
We now provide intuition behind the generic method (9)-(10) and the choices of \(\xi _{i,j}^{k}\)’s and \(M_{i}^{k}\)’s. The method (9)-(10) corresponds to an inexact first order or an inexact second order method to solve (2) – and hence to approximately solve (1). The main source of inexactness is due to the sparsification (\(\xi _{i,j}^{k}\)’s). The bidirectional communication \(\left (\xi _{i,j}^{k} = z_{i}^{k} \cdot z_{j}^{k}\right)\) is appealing as it preserves symmetry in the underlying weight matrix, which is known to be a beneficial theoretical property. On the other hand, the bidirectional sparsification is also wasteful in that a node ignores the received message from a neighbor if its own transmission to the same neighbor is not successful (see formula (9)). With respect to the choice first versus second order method (the choice of \(M_{i}^{k}\)), the second order choice is computationally more expensive per iteration due to the Hessian computations; on the other hand, it can improve convergence speed iteration-wise.
The pseudocode for the general algorithmic framework is in Algorithm 1. A summary of all the considered methods within the framework of Algorithm 1 is given in Table 1.
2.3 Convergence analysis
In this section, a convergence analysis of the algorithm variant with unidirectional communications is carried out (See ahead Method FUI in Section 2.6). More precisely, in this section we assume the following choice of \(M_{i}^{k}\) and \(\xi _{ij}^{k}\):
$$ M_{i}^{k}=I, \quad \xi_{ij}^{k}=z_{j}^{k}. $$
(12)
To the best of our knowledge, except for a different estimation setting [14], this algorithm has not been studied before. The following assumptions are needed.
Assumption 2.1.
(a) Each function fi:IRs→IR,i=1,...,n is twice differentiable, strongly convex with strong convexity modulus μ>0, and it has Lipschitz continuous gradient with the constant L, L≥μ.
(b) The graph G is undirected, connected and simple.
(c) The step size α in (2) satisfies \(\alpha < \min \left \{\frac {1}{2L},\frac {1+\lambda _{n}(W)}{L}\right \}\).
By Assumption 2.1, Ψ is strongly convex with modulus μ. Moreover, the gradient is Lipschitz continuous with the constant
$$ L_{\Psi}:=L+\frac{1-\lambda_{n}(W)}{\alpha}. $$
(13)
Notice that Assumption 2.1 (c) implies that α<(1+λn(W))/L, which is equivalent to
$$ \alpha< \frac{2}{L_{\Psi}}. $$
(14)
Let \(x^{k}=\left (\left (x_{1}^{k}\right)^{T},...,\left (x_{n}^{k}\right)^{T}\right)^{T}\). We have the following convergence result for the first order method with unidirectional communications.
Theorem 2.1.
Let {xk} be a sequence generated by Algorithm 1, method FUI, and Assumption 2.1 holds. Then, the following results hold:
(a) Assume that the sequence {pk} converges to one as k→∞. Then, the sequence of iterates {xk} converges to x∙ in the expected error norm, i.e., there holds:
$$ {\lim}_{k \rightarrow \infty} E\left[\left\|x^{k}-x^{\bullet}\right\|\right] = 0. $$
(15)
(b) Assume that the sequence {pk} converges to one geometrically as k→∞, i.e., pk=1−δk+1, for all k, Then, there holds:
$$ E\left[\left\|x^{k}-x^{\bullet}\right\|\right] = O\left(\gamma^{k}\right), $$
(16)
where γ<1 is a positive constant.
(c) Assume that pk≥pmin for all k and for some pmin∈(0,1)and that the iterative sequence {xk} is uniformly bounded, i.e., there exists a constant C1>0 such that E[∥xk∥]≤C1, for all k. Then, there holds:
$$ E\left[\left\|x^{k}-x^{\bullet}\right\|\right] \leq \theta^{k} \|x^{0} - x^{\bullet}\| + \left(1-p_{min}\right)^{2} \,C_{2}, $$
(17)
where \(C_{2}=\frac {2nC_{1}} {\alpha \mu }\) and θ∈(0,1).
Theorem 2.1 demonstrates that the Algorithm 1 with sparsified and unidirectional communications converges. More precisely, as long as the sequence pk converges to one, even arbitrarily slowly’, the sequence {xk} converges to the solution of (2) in the expected error norm sense. When the convergence of pk to one is geometric, we have that xk converges geometrically, i.e., at a linear rate. Finally, when pk stays bounded away from one, under the additional assumption that the sequence {xk} is uniformly bounded, the algorithm converges to a neighbourhood of the solution to (2), where the neighbourhood size is controlled by parameter pmin (the closer pmin to one, the smaller the error). This complements the existing results in [16] which concerns bidirectional communications.
Next, the proof of Theorem 2.1 will be carried out. To avoid notation clutter, let the dimension of the original problem (1) be s=1. The proof relies on the fact that the method can be written as an inexact gradient method for minimization of Ψ. More specifically, it can be shown that the algorithm determined by (9) – (12) is equivalent to the following:
$$ x^{k+1}=x^{k}-\alpha\left[\nabla\Psi\left(x^{k}\right)+e^{k}\right], $$
(18)
where \(e^{k}=\left (e^{k}_{1},...,e_{n}^{k}\right)^{T}\) is given by
$$ e_{i}^{k}=\frac{1}{\alpha}\sum_{j \in \Omega_{i}}W_{ij}\left(z_{j}^{k}-1\right)\left(x_{i}^{k}-x_{j}^{k}\right) $$
(19)
and ek∈ Rn. Indeed, in view of (12), method (9)-(10) can be represented as
$$ x^{k+1}=x^{k}-\alpha \nabla F\left(x^{k}\right)-\left(I-W_{k}\right)x^{k}, $$
(20)
where
$$ F: \text{I\!R}^{n} \rightarrow \text{I\!R}, F(x)=\sum_{i=1}^{n} f_{i}\left(x_{i}\right), $$
(21)
$$ [W_{k}]_{ij}=\left\{\begin{array}{ll} W_{ij}z_{j}^{k}, & \text{if} \{i,j\} \in E, i \neq j, \\ 0, & \text{if} \{i,j\} \notin E, i \neq j, \\ 1-\sum_{l \neq i}[W_{k}]_{il}, & \text{if} i=j. \end{array}\right. $$
(22)
Thus,
$$ \begin{aligned} x^{k+1} & = x^{k}-\alpha \left(\nabla F\left(x^{k}\right)+\frac{1}{\alpha} \left(I-W_{k}\right)x^{k} \pm \frac{1}{\alpha} (I-W)x^{k} \right) \\ & =x^{k}-\alpha \left(\nabla \Psi \left(x^{k}\right)+\frac{1}{\alpha}\left(\left(I-W_{k}\right)x^{k}-(I-W)x^{k}\right)\right). \end{aligned} $$
(23)
Therefore, for each component i, the error is determined by
$$ e_{i}^{k}=\frac{1}{\alpha}\left(\sum_{j \in \Omega_{i}}W_{ij}z_{j}^{k} \left(x_{i}^{k}-x_{j}^{k}\right)-\sum_{j \in \Omega_{i}}W_{ij} \left(x_{i}^{k}-x_{j}^{k}\right)\right), $$
(24)
and (19) follows.
Next we state and prove an important result. Here and throughout the paper, ||·|| denotes the vector 2-norm and the corresponding matrix norm.
Lemma 2.2.
Suppose that Assumption 2.1 holds. Then for each k we have
$$ ||x^{k}-x^{\bullet}|| \leq \theta^{k}||x^{0}-x^{\bullet}||+\alpha \sum_{t=1}^{k}\theta^{k-t}||e^{t-1}||, $$
(25)
where x0 is the initial iterate and θ= max{1−αμ,αLΨ−1}<1.
Proof. Using (18) and the fact that ∇Ψ(x∙)=0 we obtain
$$ x^{k+1}-x^{\bullet}=x^{k} - x^{\bullet}-\alpha e^{k} -\alpha \left(\nabla \Psi \left(x^{k}\right)-\nabla \Psi \left(x^{\bullet}\right)\right). $$
(26)
Further, there exists a symmetric positive definite matrix Bk such that
$$ \nabla \Psi \left(x^{k}\right)-\nabla \Psi \left(x^{\bullet}\right)=B_{k} \left(x^{k}-x^{\bullet}\right) $$
(27)
and its spectrum belongs to [μ,LΨ]. Thus, we obtain
$$ \|I-\alpha B_{k}\|\leq \max \left\{1-\alpha \mu, \alpha L_{\Psi}-1\right\}:=\theta. $$
(28)
Notice that the Assumption 2.1 (c) implies that θ<1 since (14) holds and L≥μ. Moreover, putting together (26) - (28), we obtain
$$ \|x^{k+1}-x^{\bullet}\|\leq \theta \|x^{k} - x^{\bullet}\| +\alpha \|e^{k}\| $$
(29)
and applying the induction argument we obtain the desired result. \(\square \)
To complete the proof of parts (a) and (b) of Theorem 2.1, we need to derive an upper bound for ||ek|| in the expected-norm sense. In order to do so, it is needed to establish the boundedness of iterates xk in the expected norm sense.
Lemma 2.3.
Let Assumption 2.1 hold, and consider the setting of Theorem 2.1 (a). Then, there holds E[||xk||]≤Cx for all k, where Cx is a positive constant.
Proof. The update rule (20) can be written equivalently as follows
$$ x^{k+1}=W_{k} x^{k}-\alpha \nabla F\left(x^{k}\right). $$
(30)
Introduce \(\widetilde {W_{k}}=W_{k}-W\), and rewrite (30) as
$$ x^{k+1}=W x^{k}-\alpha \nabla F\left(x^{k}\right)+\widetilde{W_{k}} x^{k}. $$
(31)
Denote by x′ the minimizer of F. Then, by the Mean Value Theorem, there holds
$$ \begin{aligned} \nabla F\left(x^{k}\right)-\nabla F\left(x^{\prime}\right) & =\underbrace{\left[ \int_{0}^{1} \nabla^{2}F\left(x^{\prime}+t\left(x^{k}-x^{\prime}\right)\right) dt \right]}_{H_{k}} \left(x^{k}-x^{\prime}\right) \\ & =H_{k}\left(x^{k}-x^{\prime}\right)=H_{k} x^{k}-H_{k} x^{\prime}, \end{aligned} $$
(32)
and
$$ x^{k+1}=\left(W-\alpha H_{k}\right) x^{k}+\widetilde{W_{k}} x^{k}+\alpha H_{k} x^{\prime}- \alpha \nabla F(x^{\prime}). $$
(33)
Note that ||Hk||≤L, by Assumption 2.1. Also, note that ||W−αHk||≤1−αμ, for \(\alpha \leq \frac {1}{2L}\). Therefore, the following can be stated
$$ \begin{aligned} ||x^{k+1}|| & \leq (1-\alpha \mu) ||x^{k}||+\underbrace{\alpha\left(L||x^{\prime}||+||\nabla F\left(x^{\prime}\right)||\right)}_{C^{\prime}} \\ & +||\widetilde{W_{k}}|| \cdot ||x_{k}|| \\ & =(1-\alpha \mu)||x^{k}||+C^{\prime}+||\widetilde{W_{k}}|| \cdot ||x_{k}||. \end{aligned} $$
(34)
Next, \(||\widetilde {W_{k}}||\) will be upper bounded. Note that
$$ ||\widetilde{W_{k}}|| \leq \sqrt{n}||\widetilde{W_{k}}||_{1} \leq \sqrt{n} \sum_{i=1}^{n} \sum_{j=1}^{n}|\left[\widetilde{W_{k}}\right]_{ij}|. $$
(35)
Therefore,
$$ ||\widetilde{W_{k}}|| \leq 2\sqrt{n} \sum_{i=1}^{n} \sum_{j=1}^{n} W_{ij}\left(1-z_{j}^{k}\right). $$
(36)
Taking expectation and using the fact that \(E\left [z_{j}^{k}\right ]=p_{k}\), for all k, it can be concluded that
$$ E\left[||\widetilde{W_{k}}||\right] \leq \widetilde{C} \left(1-p_{k}\right) $$
(37)
for some positive constant \(\widetilde {C}\). Now, using independence of \(\widetilde {W_{k}}\) and xk, the following can be obtained from (34),
$$ \begin{aligned} E\left[||x^{k+1}||\right] & \leq (1-\alpha \mu) E\left[||x^{k}||\right]+C^{\prime}+\left(1-p_{k+1}\right) \widetilde{C} E \left[||x^{k}||\right] \\ & = \left(1-\alpha \mu +\widetilde{C}\left(1-p_{k+1}\right) \right)E\left[||x^{k}||\right]+C^{\prime}. \end{aligned} $$
(38)
As pk→1, i.e., (1−pk)→0, it is clear that, for sufficiently large k, there holds
$$ E\left[||x^{k+1}||\right] \leq \left(1-\frac{1}{2} \alpha \mu \right) E\left[||x^{k}||\right] + C^{\prime}. $$
(39)
This implies that there exists a constant Cx such that E[||xk||]≤Cx, for all k=0,1,.... \(\square \)
Applying Lemma 2.3, the following result is obtained.
Lemma 2.4.
Suppose that the Assumption 2.1 holds and E(∥xk∥)≤C1 for all k and some constant C1. Then the error sequence {∥ek∥} satisfies
$$ E\left[||e^{k}||\right] \leq \left(1-p_{k}\right) C_{e}, $$
(40)
for the constant \(C_{e}=\frac {2n}{\alpha }\left (1-p_{min}\right) C_{1}\).
Proof. The proof follows straightforwardly from (19) and Lemma 2.3. Consider (24). Then, \(|e_{i}^{k}|\) can be upper bounded as follows:
$$ |e_{i}^{k}| \leq \frac{1} {\alpha} \sum_{j \in \Omega_{i}} w_{ij} |1-z_{j}^{k}| 2 \left\|x^{k}\right\|. $$
(41)
This yields:
$$ \left\|e^{k}\right\| \leq \left\|e^{k}\right\|_{1} = \sum_{i=1}^{n} \frac{2} {\alpha} \sum_{j \in \Omega_{i}} w_{ij} |1-z_{j}^{k}| \left\|x^{k}\right\|. $$
(42)
Taking expectation while using independence of \(z_{j}^{k}\) and xk, and using E(∥xk∥)≤C1; \(\sum _{j \in \Omega _{i}} \leq 1\); and \(E\left (|1-z_{j}^{k}|\right) = 1-p_{k}\), the result follows. \(\square \)
Now, Theorem 2.1 can be proved as follows.
Proof of Theorem 2.1. We first prove part (a). Taking expectation in Lemma 2.2, and using Lemma 2.4, we get
$$ \begin{aligned} E\left[||x^{k}-x^{\bullet}||\right] & \leq \theta^{k} ||x^{0} - x^{\bullet}|| +\alpha \sum_{t=1}^{k}\theta^{k-t} E\left[||e^{t-1}||\right] \\ & \leq \theta^{k}||x^{0}-x^{\bullet}|| +\alpha \sum_{t=1}^{k} \theta^{k-t} \cdot C_{e} \left(1-p_{t-1}\right). \end{aligned} $$
(43)
Next, applying Lemma 3.1 in [31], it follows that
$$ E\left[\left\|x^{k} - x^{\bullet}\right\|\right] \rightarrow 0, $$
(44)
as we wanted to prove.
Let us now consider the part (b). Note that, in this case, we have that 1−pk=δk+1, for all k. Specializing the bound in (43) to this choice of pk, the following holds
$$ \begin{aligned} E\left[||x^{k}-x^{\bullet}||\right] & \leq \theta^{k}||x^{0}-x^{\bullet}|| +\alpha C_{e} \sum_{t=1}^{k} \theta^{k-t} \delta^{t}, \end{aligned} $$
(45)
and using the fact that \(s_{k}:=\sum _{t=1}^{k} \theta ^{k-t} \delta ^{t}\) converges to zero R-linearly (see Lemma II.1 from [16]), we obtain the result.
Finally, we prove part (c). Here, we upper bound the term (1−pt−1) in (43) with (1−pmin). For this case we obtain
$$ \begin{aligned} E\left[||x^{k}-x^{\bullet}||\right] & \leq \theta^{k}||x^{0}-x^{\bullet}|| \\ & + \left(1-p_{min}\right) C_{e} \frac{1}{ \mu}, \end{aligned} $$
(46)
which completes the proof of part (c).\(\square \)
2.4 Implementation and infrastructure
A parallel implementation of Algorithm 1 was developed, using MPI [19]. The testing was performed on the AXIOM computing facility consisting of 16 nodes (8 x Intel i7 5820k 3.3GHz and 8 x Intel i7 8700 3.2GHz CPU - 96 cores and 16GB DDR4 RAM/node) interconnected by a 10 Gbps network.
Network configurations of grid and regular graphs are taken into consideration for graph G. A set of tests is conducted for the same data set with the same number of nodes for both types of graphs - d-regular graphs and grid.
The input data for the algorithm are read from binary files by the master process. The master process then scatters the data to other processes in equal pieces. If the data size is not divisible by the number of processes, then the remaining data is assigned to the master process. Therefore, the data are in the memory during computation and there is no Input/Output (I/O) operation performed while executing the algorithm.
The communication between the nodes is realized by creating a set of communicators – one for each node. The i-th communicator contains the i-th node as the master, and the nodes that are its neighbors. When sparsifying the communication between the nodes, the communicators should be recreated across the iterations, in order to ensure that only active nodes can send their results, see [11]. When using bidirectional communications, an active node is being included into its own communicator and into the communicators of its active neighbours. An inactive node is not included in the communicators of its neighbors, and also does not need its own communicator at the current iteration. In the case of unidirectional communication, an inactive node is included in its own communicator, but not in the communicators of its neighbors.
The data distribution process does not consume a large amount of the execution time. For example, considering a data set that contains a matrix of 5000×6000 elements and a vector of 5000 elements, the initial setup, including reading and scattering the data, as well as the creation of the communicators, takes about 0.3s per process. When compared to the overall run-time of the tests it represents a relatively small percentage. Regarding the case with the lowest execution time this percentage is 5%. On the other hand it is only 0.0007%, in the case with the highest execution time.
Regarding the stopping criteria, we let the algorithms run until ||∇Ψ(xk)||≤ε, where ε=0.01. Note that the gradient ∇Ψ(xk) is not computable by any node in a distributed graph G in general. In our implementation ∇Ψ(xk) is maintained by the master node. While not being a realistic stopping criterion in a fully distributed setting, it allows us to adequately compare different algorithmic strategies,
The implementation relies on efficient LAPACK [32] and BLAS [33] linear algebra operations, applied on the nodes, while performing local calculations.
2.5 Simulation setup
The tests were performed on two types of graphs: d-regular and grid graphs with different number of nodes. We constructed the d-regular graphs in the following way. For 8-regular graphs, for each number of nodes n, we construct an 8-regular graph starting from a ring graph with nodes {1,2,...,n} and then adding to each node i the links to the nodes i−4,i−3,i−2, and i+2,i+3, and i+4, where the subtractions and additions here are modulo n. The same principle was also used for 4-regular and 16-regular graphs used in this paper.
The tests are performed for the logistic loss functions given by
$$ f_{i}(x)=\sum_{j=1}^{J}\mathcal{J}_{logis}\left(b_{ij}\left(x_{1}^{\top} a_{ij}+x_{0}\right)\right)+\frac{\tau}{n}||x||^{2}. $$
(47)
Here, \( x = \left (x_{1}^{T}, x_{0}\right) \in \mathbb {R}^{s-1} \times \mathbb {R} \) represents the optimization variable and τ is the penalty parameter. The input values are \(a_{i} \in \mathbb {R}^{s-1} \) and \(b_{i} \in \mathbb {R}. \)
The testing is performed on different versions of Algorithm 1 with sparsified communication, for both bidirectional and unidirectional communication strategies (see ahead Table 1).
The input data are represented as an r×(s−1) sized matrix of features, and an r sized vector of labels. Both the matrix and the vector are then divided into n parts corresponding to the nodes as explained in the previous section. We then vary n (and the corresponding graph G) and investigate the performance of Algorithm 1.
The following data sets were used for testing.
-
The Conll data set [34, 35], that concerns language-independent named entity recognition. It has r=220663 and s=20 as the input data sizes. This data set is only used for comparing the performance of the algorithm between regular and grid graphs.
-
The Gisette data set [36–38], known as a handwritten digit recognition problem. Its input data sizes are r=6000 and s=5001. The data set is used for testing the different alternatives of the algorithm as well as for determining the most suitable value of d for d-regular graphs.
-
The YearPredictionMSD train data set is used to predict the release year of a song from audio features [37, 39, 40]. Here r and s are r=463715 and s=91. The data set is also used for testing the different alternatives of the algorithm.
-
The MNIST data set represents a database of handwritten digits [41, 42], with input data sizes r=60000 and s=785. This data set is also used for testing the different alternatives of the algorithm.
-
The Relative location of CT slices on axial axis data set (referred to as CT data set further on), containing features extracted from CT images [37, 43, 44]. The data sizes are r=53500 and s=386. This data set is also used for testing the different alternatives of the algorithm.
-
The p53 Mutants data set [37, 45–48] (referred to as p53 data set further on) is used for modelling mutant p53 transcriptional activity (active or inactive) based on data extracted from biophysical simulations. The data set sizes are r=31159 and s=5410. The data set is also used for testing the different alternatives of the algorithm.
The parameters for Algorithm 1 are set according to the experimentally obtained conclusions. The value α can be defined as \(\alpha =\frac {1}{KL}\), where L is the Lipschitz gradient constant and K∈[10,100], as proposed in [5]. The value of α can be fine-tuned according to the data set used for the tests. Increasing this value can lead to faster convergence. However, if the value is too large, then the algorithm might converge to a coarse solution neighbourhood. The values of α used for the mentioned 5 data sets are obtained experimentally and are listed below:
-
α=0.0001 for the Gisette data set;
-
α=0.001 for the p53 data set;
-
α=0.1 for the YearPredictionMSD, MNIST, Conll and CT data sets.
A larger value of α=0.1 can be applied in the cases of relatively small number of features, compared to the number of instances (i.e. rows of data). Here, in all the 4 cases for α=0.1, the number of features is smaller than 1000.
The probability of communication pk is set as follows: pk=1−0.5k, where k is the iteration counter, or as pk=(k+1)−1. In other words, we consider an increasing and a decreasing sequence for the pk’s. The decreasing sequence for the probability is of interest for analysis, as it gradually reduces the communication time over the iterations. This might require more iterations as the communication links are sparser. The increasing sequence for the probability may, on the other hand require less iterations, but those iterations are becoming increasingly more time consuming as the number of communication lines increases. It is of interest to investigate both possibilities.
The local second order information-capturing matrix \(M_{i}^{k}\) can be included to the computation as \(M_{i}^{k}=D_{i}^{k}\), where \(D_{i}^{k}\) is defined as in (11), or it can be replaced by an identity matrix \(M_{i}^{k}=I\). Both possibilities are of interest for testing as it is of interest to establish empirically if the additional computation required to solve the system of linear equations in (9) pays off. With \( M_{i}^{k} = I \) we are performing (probably larger) number of cheaper iterations.
2.6 Description of the methods
Table 1 lists the different methods as alternatives of Algorithm 1, considering the solution update, defined in (9), (10) and (11). The naming convention for the methods was already described in the introductory section (see page 2).
Method SBC represents the initial version of the algorithm, used as the benchmark here, where Method FBC is its first order equivalent. These methods does not utilize any communication sparsification.
Note that Methods FBI, FBD, FUI, FUD, SBI, SBD, SUI, SUD use sparsification with either increasing or decreasing communication probabilities pk. The rationale for choosing a linearly increasing pk and a sub-linearly decreasing pk is adopted according to insights available in the literature; see, e.g., [16], [14]. While it is possible to consider other choices and fine-tuning of the sequence pk, this topic is outside of the paper’s scope. Our primary aim is to investigate the feasibility and performance of increasing and of decreasing sequence of pk’s relative to the always-communicating strategy (Method SBC and Method FBC), as well as relative to the unidirectional versus bidirectional communication, and the first order versus second order methods.
The convergence analysis for the novel method with unidirectional communication Method FUI is presented here, where Methods SUI and SUD, that also rely on unidirectional communication, remain open for theoretical analysis. The Methods FBI, FBD, SBI and SBD, using bidirectional communication are already analysed in the literature (see [12–17]).
The listed methods and data sets described before are used to derive some empirical conclusions. As expected, the analysis of obtained results provides some insights about the optimal number of nodes for different setups. Also, the advantages of particular methods are clearly visible and one can estimate the usefulness of sparsification based on these results, keeping in mind that the tests might be influenced by the selection of data sets. Nevertheless, we believe that the obtained insights are useful.