 Research
 Open Access
 Published:
FROST—Fast rowstochastic optimization with uncoordinated stepsizes
EURASIP Journal on Advances in Signal Processingvolume 2019, Article number: 1 (2019)
Abstract
In this paper, we discuss distributed optimization over directed graphs, where doubly stochastic weights cannot be constructed. Most of the existing algorithms overcome this issue by applying pushsum consensus, which utilizes columnstochastic weights. The formulation of columnstochastic weights requires each agent to know (at least) its outdegree, which may be impractical in, for example, broadcastbased communication protocols. In contrast, we describe FROST (Fast RowstochasticOptimization with uncoordinated STepsizes), an optimization algorithm applicable to directed graphs that does not require the knowledge of outdegrees, the implementation of which is straightforward as each agent locally assigns weights to the incoming information and locally chooses a suitable stepsize. We show that FROST converges linearly to the optimal solution for smooth and strongly convex functions given that the largest stepsize is positive and sufficiently small.
Introduction
In this paper, we study distributed optimization, where n agents are tasked to solve the following problem:
where each objective, \(f_{i}:\mathbb {R}^{p}\rightarrow \mathbb {R}\), is private and known only to agent i. The goal of the agents is to find the global minimizer of the aggregate cost, F(x), via local communication with their neighbors and without revealing their private objective functions. This formulation has recently received great attention due to its extensive applications in, for example, machine learning [1–6], control [7], cognitive networks, [8, 9], and source localization [10, 11].
Early work on this topic includes Distributed Gradient Descent (DGD) [12, 13], which is computationally simple but is slow due to a diminishing stepsize. The convergence rates are \(\mathcal {O}\left (\frac {\log k}{\sqrt {k}}\right)\) for general convex functions and \(\mathcal {O}\left (\frac {\log k}{k}\right)\) for strongly convex functions, where k is the number of iterations. With a constant stepsize, DGD converges faster albeit to an inexact solution [14, 15]. Related work also includes methods based on the Lagrangian dual [16–19] to achieve faster convergence, at the expense of more computation. To achieve both fast convergence and computational simplicity, some fast distributed firstorder methods have been proposed. A Nesterovtype approach [20] achieves \(\mathcal {O}\left (\frac {\log k}{k^{2}}\right)\) for smooth convex functions with bounded gradient assumption. EXact firsTordeR Algorithm (EXTRA) [21] exploits the difference of two consecutive DGD iterates to achieve a linear convergence to the optimal solution. Exact diffusion [22, 23] applies an adaptthencombine structure [24] to EXTRA and generalizes the symmetric doubly stochastic weights required in EXTRA to locally balanced rowstochastic weights over undirected graphs. Of significant relevance to this paper is a distributed gradient tracking technique built on dynamic consensus [25], which enables each agent to asymptotically learn the gradient of the global objective function. This technique was first proposed simultaneously in [26, 27]. Xu et al. and Qu and Li [26, 28] combine it with the DGD structure to achieve improved convergence for smooth and convex problems. Lorenzo and Scutari [27, 29], on the other hand, propose the NEXT framework for a more general class of nonconvex problems.
All of the aforementioned methods assume that the multiagent network is undirected. In practice, it may not be possible to achieve undirected communication. It is of interest, thus, to develop optimization algorithms that are fast and are applicable to arbitrary directed graphs. The challenge here lies in the fact that doubly stochastic weights, standard in many distributed optimization algorithms, cannot be constructed over arbitrary directed graphs. In particular, the weight matrices in directed graphs can only be either rowstochastic or columnstochastic, but not both.
We now discuss related work on directed graphs. Early work based on DGD includes subgradientpush [30, 31] and DirectedDistributed Gradient Descent (DDGD) [32, 33], with a sublinear convergence rate of \(\mathcal {O}\left (\frac {\log k}{\sqrt {k}}\right)\). Some recent work extends these methods to asynchronous networks [34–36]. To accelerate the convergence, DEXTRA [37] combines pushsum [38] and EXTRA [21] to achieve linear convergence given that the stepsize lies in some nontrivial interval. This restriction on the stepsize is later relaxed in ADDOPT/PushDIGing [39, 40], which linearly converge for a sufficiently small stepsize. Of relevance is also [41], where distributed nonconvex problems are considered with columnstochastic weights. More recent work [42, 43] proposes the \(\mathcal {AB}\) and \(\mathcal {AB}m\) algorithms, which employ both row and columnstochastic weights to achieve (accelerated) linear convergence over arbitrary strongly connected graphs. Note that although the construction of doubly stochastic weights is avoided, all of the aforementioned methods require each agent to know its outdegree to formulate doubly or columnstochastic weights. This requirement may be impractical in situations where the agents use a broadcastbased communication protocol. In contrast, Refs. [44, 45] provide algorithms that only use rowstochastic weights. Rowstochastic weight design is simple and is further applicable to broadcastbased methods.
In this paper, we focus on optimization with rowstochastic weights following the recent work in [44, 45]. We propose a fast optimization algorithm, termed as FROST (Fast Rowstochastic Optimization with uncoordinated STepsizes), which is applicable to both directed and undirected graphs with uncoordinated stepsizes among the agents. Distributed optimization (based on gradient tracking) with uncoordinated stepsizes has been previously studied in [26, 46, 47], over undirected graphs with doubly stochastic weights, and in [48], over directed graphs with columnstochastic weights. These works introduce a notion of heterogeneity among the stepsizes, defined respectively as the relative deviation of the stepsizes from their average in [26, 46] and as the ratio of the largest to the smallest stepsize in [47, 48]. It is then shown that when the heterogeneity is small enough, i.e., the stepsizes are very close to each other, and when the largest stepsize follows a bound as a function of the heterogeneity, the proposed algorithms linearly converge to the optimal solution. A challenge in this formulation is that choosing a sufficiently small, local stepsize does not ensure small heterogeneity, while no stepsize can be chosen to be zero. In contrast, a major contribution of this paper is that we establish linear convergence with uncoordinated stepsizes when the upper bound on the stepsizes is independent of any notion of heterogeneity. The implementation of FROST therefore is completely local, since each agent locally chooses a sufficiently small stepsize, independent of other stepsizes, and locally assigns rowstochastic weights to the incoming information. In addition, our analysis shows that all stepsizes except one can be zero for the algorithm to work, which is a novel result in distributed optimization. We show that FROST converges linearly to the optimal solution for smooth and strongly convex functions.
Notation: We use lowercase bold letters to denote vectors and uppercase italic letters to denote matrices. The matrix, I_{n}, represents the n×n identity, whereas 1_{n} (0_{n}) is the ndimensional uncoordinated vector of all 1’s (0’s). We further use e_{i} to denote an ndimensional vector of all 0’s except 1 at the ith location. For an arbitrary vector, x, we denote its ith element by [x]_{i} and diag{x} is a diagonal matrix with x on its main diagonal. We denote by X⊗Y the Kronecker product of two matrices, X and Y. For a primitive, rowstochastic matrix, \(\underline {A}\), we denote its left and right Perron eigenvectors by π_{r} and 1_{n}, respectively, such that \(\boldsymbol {\pi }_{r}^{\top }\mathbf {1}_{n} = 1\); similarly, for a primitive, columnstochastic matrix, \(\underline {B}\), we denote its left and right Perron eigenvectors by 1_{n} and π_{c}, respectively, such that \(\mathbf {1}_{n}^{\top }\boldsymbol {\pi }_{c} = 1\) [49]. For a matrix, X, we denote ρ(X) as its spectral radius and diag(X) as a diagonal matrix consisting of the corresponding diagonal elements of X. The notation ∥·∥_{2} denotes the Euclidean norm of vectors and matrices, while ∥·∥_{F} denotes the Frobenius norm of matrices. Depending on the argument, we denote ∥·∥ either as a particular matrix norm, the choice of which will be clear in Lemma 1, or a vector norm that is compatible with this matrix norm, i.e., ∥Xx∥≤∥X∥∥x∥ for all matrices, X, and all vectors, x [49].
We now describe the rest of the paper. Section 2 states the problem and assumptions. Section 3 reviews related algorithms that use doubly stochastic or columnstochastic weights and shows the intuition behind the analysis of these types of algorithms. In Section 4, we provide the main algorithm, FROST, proposed in this paper. In Section 5, we develop the convergence properties of FROST. Simulation results are provided in Section 6, and Section 7 concludes the paper.
Problem formulation
Consider n agents communicating over a strongly connected network, \(\mathcal {G}=(\mathcal {V},\mathcal {E})\), where \(\mathcal {V}=\{1,\cdots,n\}\) is the set of agents and \(\mathcal {E}\) is the set of edges, \((i,j), i,j\in \mathcal {V}\), such that agent j can send information to agent i, i.e., j→i. Define \(\mathcal {N}_{i}^{{\text {in}}}\) as the collection of inneighbors, i.e., the set of agents that can send information to agent i. Similarly, \(\mathcal {N}_{i}^{{ \text {out}}}\) is the set of outneighbors of agent i. Note that both \(\mathcal {N}_{i}^{{ \text {in}}}\) and \(\mathcal {N}_{i}^{{ \text {out}}}\) include agent i. The agents are tasked to solve the following problem:
where \(f_{i}:\mathbb {R}^{p}\rightarrow \mathbb {R}\) is a private cost function only known to agent i. We denote the optimal solution of P1 as x^{∗}. We will discuss different distributed algorithms related to this problem under the applicable set of assumptions, described below:
Assumption 1
The graph, \(\mathcal {G}\), is undirected and connected.
Assumption 2
The graph, \(\mathcal {G}\), is directed and strongly connected.
Assumption 3
Each local objective, f_{i}, is convex with bounded subgradient.
Assumption 4
Each local objective, f_{i}, is smooth and strongly convex, i.e., ∀i and \(\forall \mathbf {x}, \mathbf {y}\in \mathbb {R}^{p}\),

i.
There exists a positive constant l such that
$$\qquad\left\\mathbf{\nabla} f_{i}(\mathbf{x})\mathbf{\nabla} f_{i}(\mathbf{y})\right\_{2}\leq l\\mathbf{x}\mathbf{y}\_{2}. $$ 
ii.
there exists a positive constant μ such that
$$f_{i}(\mathbf{y})\geq f_{i}(\mathbf{x})+\nabla f_{i}(\mathbf{x})^{\top}(\mathbf{y}\mathbf{x})+\frac{\mu}{2}\\mathbf{x}\mathbf{y}\_{2}^{2}. $$
Clearly, the Lipschitz continuity and strong convexity constants for the global objective function, \(F=\tfrac {1}{n}{\sum }_{i=1}^{n}f_{i}\), are l and μ, respectively.
Assumption 5
Each agent in the network has and knows its unique identifier, e.g., 1,⋯,n.
If this were not true, the agents may implement a finitetime distributed algorithm to assign such identifiers, e.g., with the help of task allocation algorithms, [50, 51], where the task at each agent is to pick a unique number from the set {1,…,n}.
Assumption 6
Each agent knows its outdegree in the network, i.e., the number of its outneighbors.
We note here that Assumptions 3 and 4 do not hold together; when applicable, the algorithms we discuss use either one of these assumptions but not both. We will discuss FROST, the algorithm proposed in this paper, under Assumptions 2, 4, 5.
Related work
In this section, we discuss related distributed firstorder methods and provide an intuitive explanation for each one of them.
Algorithms using doubly stochastic weights
A wellknown solution to distributed optimization over undirected graphs is Distributed Gradient Descent (DGD) [12, 13], which combines distributed averaging with a local gradient step. Each agent i maintains a local estimate, \(\mathbf {x}_{k}^{i}\), of the optimal solution, x^{∗}, and implements the following iteration:
where W={w_{ij}} is doubly stochastic and respects the graph topology. The stepsize α_{k} is diminishing such that \({\sum }_{k=0}^{\infty }\alpha _{k}=\infty \) and \({\sum }_{k=0}^{\infty }\alpha _{k}^{2}<\infty \). Under the Assumptions 1, 3, and 6, DGD converges to x^{∗} at the rate of \(\mathcal {O}\left (\frac {\log k}{\sqrt {k}}\right)\). The convergence rate is slow because of the diminishing stepsize. If a constant stepsize is used in DGD, i.e., α_{k}=α, it converges faster to an error ball, proportional to α, around x^{∗} [14, 15]. This is because x^{∗} is not a fixedpoint of the above iteration when the stepsize is a constant.
To accelerate the convergence, Refs. [26, 28] recently propose a distributed firstorder method based on gradient tracking, which uses a constant stepsize and replaces the local gradient, at each agent in DGD, with an asymptotic estimator of the global gradient^{Footnote 1}. The algorithm is updated as follows [26, 28]:
initialized with \(\mathbf {y}_{0}^{i}=\nabla f_{i}\left (\mathbf {x}_{0}^{i}\right)\) and an arbitrary \(\mathbf {x}_{0}^{i}\) at each agent. The first equation is essentially a descent method, after mixing with neighboring information, where the descent direction is \(\mathbf {y}_{k}^{i}\), instead of \(\nabla f_{i}\left (\mathbf {x}_{k}^{i}\right)\) as was in Eq. (1). The second equation is a global gradient estimator when viewed as dynamic consensus [52], i.e., \(\mathbf {y}_{k}^{i}\) asymptotically tracks the average of local gradients: \(\frac {1}{n}{\sum }_{i=1}^{n}\nabla f_{i}\left (\mathbf {x}_{k}^{i}\right)\). It is shown in Refs. [28, 40, 46] that \(\mathbf {x}_{k}^{i}\) converges linearly to x^{∗} under Assumptions 1, 4, and 6, with a sufficiently small stepsize, α. Note that these methods, Eqs. (1) and (2a)–(2b), are not applicable to directed graphs as they require doubly stochastic weights.
Algorithms using columnstochastic weights
We first consider the case when DGD in Eq. (1) is applied to a directed graph and the weight matrix is columnstochastic but not rowstochastic. It can be obtained that [32]:
where \(\overline {\mathbf {x}}_{k}=\frac {1}{n}{\sum }_{i=1}^{n}\mathbf {x}_{k}^{i}\). From Eq. (3), it is clear that the average of the estimates, \(\overline {\mathbf {x}}_{k}\), converges to x^{∗}, as Eq. (3) can be viewed as a centralized gradient method if each local estimate \(\mathbf {x}_{k}^{i}\) converges to \(\overline {\mathbf {x}}_{k}\). However, since the weight matrix is not rowstochastic, the estimates of agents will not reach an agreement [32]. This discussion motivates combining DGD with an algorithm, called pushsum, briefly discussed next, that enables agreement over directed graphs with columnstochastic weights.
Pushsum consensus
Pushsum [38, 53] is a technique to achieve average consensus over arbitrary digraphs. At time k, each agent maintains two state vectors, \(\mathbf {x}_{k}^{i}\), \(\mathbf {z}_{k}^{i}\in \mathbb {R}^{p}\), and an auxiliary scalar variable, \(v_{k}^{i}\), initialized with \(v_{0}^{i}=1\). Pushsum performs the following iterations:
where \(\underline {B}=\left \{b_{ij}\right \}\) is columnstochastic. Equation (4a) can be viewed as an independent algorithm to asymptotically learn the right Perron eigenvector of \(\underline {B}\); recall that the right Perron eigenvector of \(\underline {B}\) is not 1_{n} because \(\underline {B}\) is not rowstochastic and we denote it by π_{c}. In fact, it can be verified that \({\lim }_{k\rightarrow \infty }v_{i}(k)=n[\!\boldsymbol {\pi }_{c}]_{i}\) and that \({\lim }_{k\rightarrow \infty }\mathbf {x}_{i}(k)=[\!\boldsymbol {\pi }_{c}]_{i}\sum _{i=1}^{n}\mathbf {x}_{i}(0)\). Therefore, the limit of z_{i}(k), as the ratio of x_{i}(k) over v_{i}(k), is the average of the initial values:
In the next subsection, we present subgradientpush that applies pushsum to DGD, see [32, 33] for an alternate approach that does not require eigenvector estimation of Eq. (4a).
Subgradientpush
To solve Problem P1 over arbitrary directed graphs, Refs. [30, 31] develop subgradientpush with the following iterations:
initialized with \(v_{0}^{i}=1\) and an arbitrary \(\mathbf {x}_{0}^{i}\) at each agent. The stepsize, α_{k}, satisfies the same conditions as in DGD. To understand these iterations, note that Eqs. (5a)–(5c) are nearly the same as Eqs. (4a)–(4c), except that there is an additional gradient term in Eq. (5b), which drives the limit of \(\mathbf {z}_{k}^{i}\) to x^{∗}. Under the Assumptions 2, 3, and 6, subgradientpush converges to x^{∗} at the rate of \(\mathcal {O}\left (\frac {\log k}{\sqrt {k}}\right)\). For extensions of subgradientpush to asynchronous networks, see recent work [34–36]. We next describe an algorithm that significantly improves this convergence rate.
ADDOPT/PushDIGing
ADDOPT [39], extended to timevarying graphs in PushDIGing [40], is a fast algorithm over directed graphs, which converges at a linear rate to x^{∗} under the Assumptions 2, 4, and 6, in contrast to the sublinear convergence of subgradientpush. The three vectors, \(\mathbf {x}^{i}_{k}\), \(\mathbf {z}^{i}_{k}\), and \(\mathbf {y}^{i}_{k}\), and a scalar \(v^{i}_{k}\) maintained at each agent i, are updated as follows:
where each agent is initialized with \(v_{0}^{i}=1\), \(\mathbf {y}_{0}^{i}=\nabla f_{i}\left (\mathbf {x}_{0}^{i}\right)\), and an arbitrary \(\mathbf {x}_{0}^{i}\). We note here that ADDOPT/PushDIGing essentially applies pushsum to the algorithm in Eqs. (2a)–(2b), where the doubly stochastic weights therein are replaced by columnstochastic weights.
The \(\mathcal {AB}\) algorithm
As we can see, subgradientpush and ADDOPT/PushDIGing, described before, have a nonlinear term that comes from the division by the eigenvector estimation. In contrast, the \(\mathcal {AB}\) algorithm, introduced in [42] and extended to \(\mathcal {AB}m\) with the addition of a heavyball momentum term in [43] and to timevarying graphs in [54], removes this nonlinearity and remains applicable to directed graphs by a simultaneous application of row and columnstochastic weights^{Footnote 2}. Each agent i maintains two variables: \(\mathbf {x}_{k}^{i}\), \(\mathbf {y}_{k}^{i}\in \mathbb {R}^{p}\), where, as before, \(\mathbf {x}_{k}^{i}\) is the estimate of x^{∗} and \(\mathbf {y}_{k}^{i}\) tracks the average gradient, \(\frac {1}{n}{\sum }_{i=1}^{n}\nabla f_{i}\left (\mathbf {x}_{k}^{i}\right)\). The \(\mathcal {AB}\) algorithm, initialized with \(\mathbf {y}_{0}^{i}=\nabla f_{i}\left (\mathbf {x}_{0}^{i}\right)\) and arbitrary \(\mathbf {x}_{0}^{i}\) at each agent, performs the following iterations:
where \(\underline {A}=\{a_{ij}\}\) is rowstochastic and \(\underline {B}=\{b_{ij}\}\) is columnstochastic. It is shown that \(\mathcal {AB}\) converges linearly to x^{∗} for sufficiently small stepsizes under the Assumptions 2, 4, and 6 [42]. Therefore, \(\mathcal {AB}\) can be viewed as a generalization of the algorithm in Eqs. (2a)–(2b) as the doubly stochastic weights therein are replaced by row and columnstochastic weights. Furthermore, it is shown in [43] that ADDOPT/PushDIGing in Eqs. (6a)–(6d) in fact can be derived from an equivalent form of \(\mathcal {AB}\) after a state transformation on the x_{k}update; see [43] for details. For applications of the \(\mathcal {AB}\) algorithm to distributed least squares, see, for instance, [56].
Algorithms using rowstochastic weights
All of the aforementioned methods require at least each agent to know its outdegree in the network in order to construct doubly or columnstochastic weights. This requirement may be infeasible, e.g., when agents use broadcastbased communication protocols. Rowstochastic weights, on the other hand, are easier to implement in a distributed manner as every agent locally assigns an appropriate weight to each incoming variable from its inneighbors. In the next section, we describe the main contribution of this paper, i.e., a fast optimization algorithm that uses only rowstochastic weights and uncoordinated stepsizes.
To motivate the proposed algorithm, we first consider DGD in Eq. (1) over directed graphs when the weight matrix in DGD is chosen to be rowstochastic, but not columnstochastic. From consensus arguments and the fact that the stepsize α_{k} goes to 0, it can be verified that the agents achieve agreement. However, this agreement is not on the optimal solution. This can be shown [32] by defining an accumulation state, \(\widehat {\mathbf {x}}_{k}={\sum }_{i=1}^{n}[\!\boldsymbol {\pi }_{r}]_{i}\mathbf {x}_{k}^{i}\), where π_{r} is the left Perron eigenvector of the rowstochastic weight matrix, to obtain
It can be verified that the agents agree to the limit of the above iteration, which is suboptimal since this iteration minimizes a weighted sum of the objective functions and not the sum. This argument leads to a modification of Eq. (8) that cancels the imbalance in the gradient term caused by the fact that π_{r} is not a vector of all 1’s, a consequence of losing the columnstochasticity in the weight matrix. The modification, introduced in [44], is implemented as follows:
where \(\underline {A}=\{a_{ij}\}\) is rowstochastic and the algorithm is initialized with \(\mathbf {y}_{0}^{i}=\mathbf {e}_{i}\) and an arbitrary \(\mathbf {x}_{0}^{i}\) at each agent. Equation (9a) asymptotically learns the left Perron eigenvector of the rowstochastic weight matrix \(\underline {A}\), i.e., \({\lim }_{k\rightarrow \infty }\mathbf {y}_{k}^{i}=\boldsymbol {\pi }_{r},\forall i\). The above algorithm achieves a sublinear convergence rate of \(\mathcal {O}\left (\frac {\log k}{\sqrt {k}}\right)\) under the Assumptions 2, 3, and 5, see [44] for details.
FROST (Fast Rowstochastic Optimization with uncoordinated STepsizes)
Based on the insights that gradient tracking and constant stepsizes provide exact and fast linear convergence, we now describe FROST that adds gradient tracking to the algorithm in Eqs. (9a)–(9b) while using constant but uncoordinated stepsizes at the agents. Each agent i at the kth iteration maintains three variables, \(\mathbf {x}_{k}^{i},\mathbf {z}_{k}^{i}\in \mathbb {R}^{p}\), and \(\mathbf {y}_{k}^{i}\in \mathbb {R}^{n}\). At k+1th iteration, agent i performs the following update:
where α_{i}’s are the uncoordinated stepsizes locally chosen at each agent and the rowstochastic weights, \(\underline {A}=\left \{a_{ij}\right \}\), respect the graph topology such that:
The algorithm is initialized with an arbitrary \(\mathbf {x}_{0}^{i}\), \(\mathbf {y}_{0}^{i}=\mathbf {e}_{i}\), and \(\mathbf {z}_{0}^{i}=\nabla f_{i}\left (\mathbf {x}_{0}^{i}\right)\). We point out that the initial condition for Eq. (10a) and the divisions in Eq. (10c) require each agent to have a unique identifier. Clearly, Assumption 5 is applicable here. Note that Eq. (10c) is a modified gradient tracking update, first applied to optimization with rowstochastic weights in [45], where the divisions are used to eliminate the imbalance caused by the left Perron eigenvector of the (rowstochastic) weight matrix \(\underline {A}\). We note that the algorithm in [45] requires identical stepsizes at the agents and thus is a special case of Eqs. (10a)–(10c).
For analysis purposes, we write Eqs. (10a)–(10c) in a compact vectormatrix form. To this aim, we introduce some notation as follows: let x_{k}, y_{k}, and ∇f(x_{k}) collect the local variables \(\mathbf {x}_{k}^{i}\), \(\mathbf {y}_{k}^{i}\), and \(\nabla f_{i}\left (\mathbf {x}_{k}^{i}\right)\) in a vector in \(\mathbb {R}^{np}\), respectively, and define
Since the weight matrix \(\underline {A}\) is primitive with positive diagonals, it is straightforward to verify that \(\widetilde {Y}_{k}\) is invertible for any k. Based on the notation above, Eqs. (10a)–(10c) can be written compactly as follows:
where \(\underline {Y}_{0}=I_{n}\), z_{0}=∇f_{0}, and x_{0} is arbitrary. We emphasize that the implementation of FROST needs no knowledge of agent’s outdegree anywhere in the network in contrast to the earlier related work in [30–33, 37, 39, 40, 42, 43]. Note that Refs. [22, 23] also use rowstochastic weights but require an additional locally balanced assumption and are only applicable to undirected graphs.
Convergence analysis
In this section, we present the convergence analysis of FROST described in Eqs. (11a)–(11c). We first define a few additional variables as follows:
Since \(\underline {A}\) is primitive and rowstochastic, from the PerronFrobenius theorem [49], we note that \(Y_{\infty } = \left (\mathbf {1}_{n}\boldsymbol {\pi }_{r}^{\top }\right)\otimes I_{p}\), where \(\boldsymbol {\pi }_{r}^{\top }\) is the left Perron eigenvector of \(\underline {A}\).
Auxiliary relations
We now start the convergence analysis with a key lemma regarding the contraction of the augmented weight matrix A under an arbitrary norm.
Lemma 1
Let Assumption 2 hold and consider the augmented weight matrix \(A=\underline {A}\otimes I_{p}\). There exists a vector norm, ∥·∥, such that \(\forall \mathbf {a}\in \mathbb {R}^{np}\),
where 0<σ<1 is some constant.
Proof
It can be verified that AY_{∞}=Y_{∞} and Y_{∞}Y_{∞}=Y_{∞}, which leads to the following relation:
Next, from the PerronFrobenius theorem, we note that [49]
thus, there exists a matrix norm, ∥·∥, with ∥A−Y_{∞}∥<1 and a compatible vector norm, ∥·∥, see Chapter 5 in [49], such that
and the lemma follows with σ=∥A−Y_{∞}∥. □
As shown above, the existence of a norm in which the consensus process with rowstochastic matrix \(\underline {A}\) is a contraction does not follow the standard 2norm argument for doubly stochastic matrices [28, 40]. The ensuing arguments built on this notion of contraction under arbitrary norms were first introduced in [39] for columnstochastic weights and in [45] for rowstochastic weights; these arguments are harmonized later to hold simultaneously for both row and columnstochastic weights in [42, 43]. The next lemma, a direct consequence of the contraction introduced in Lemma 1, is a standard result from consensus and Markov chain theory [57].
Lemma 2
Consider Y_{k}, generated from the weight matrix \(\underline {A}\). We have:
where r is some positive constant and σ is the contraction factor defined in Lemma 1.
Proof
Note that \(Y_{k} = \underline {A}^{k}\otimes I_{p}=A^{k}\) from Eq. (11a), and
From Lemma 1, we have
The proof follows from the fact that all matrix norms are equivalent. □
As a consequence of Lemma 2, we next establish the linear convergence of the sequences \(\left \{ \widetilde {Y}_{k}^{1}\right \}\) and \(\left \{ \widetilde {Y}_{k+1}^{1}\widetilde {Y}_{k}^{1}\right \}\).
Lemma 3
The following inequalities hold \(\forall k: (a) \left \\widetilde {Y}_{k}^{1}\widetilde {Y}_{\infty }^{1}\right \_{2}\leq \sqrt {n}r\widetilde {y}^{2}\sigma ^{k}\);
\((b) \left \\widetilde {Y}_{k+1}^{1}\widetilde {Y}_{k}^{1}\right \_{2}\leq 2\sqrt {n}r\widetilde {y}^{2}\sigma ^{k}\).
Proof
The proof of (a) is as follows:
where the last inequality uses Lemma 2 and the fact that \(\X\_{F}\leq \sqrt {n}\{X}_{2}\,\forall X\in \mathbb {R}^{n\times n}\). The result in (b) is straightforward by applying (a), i.e.,
which completes the proof. □
The next lemma presents the dynamics that govern the evolution of the weighted sum of z_{k}; recall that z_{k}, in Eq. (11c), asymptotically tracks the average of local gradients, \(\frac {1}{n}\sum _{i=1}^{n}\nabla f_{i}\left (\mathbf {x}_{k}^{i}\right)\).
Lemma 4
The following equation holds for all k:
Proof
Recall that Y_{∞}A=Y_{∞}. We obtain from Eq. (11c) that
Doing this iteratively, we have
With the initial conditions that z_{0}=∇f(x_{0}) and \(\widetilde {Y}_{0}=I_{np}\), we complete the proof. □
The next lemma, a standard result in convex optimization theory from [58], states that the distance to the optimal solution contracts in each step in the centralized gradient method.
Lemma 5
Let μ and l be the strong convexity and Lipschitz continuity constants for the global objective function, F(x), respectively. Then \(\forall \mathbf {x}\in \mathbb {R}^{p}\) and \(0<\alpha <\frac {2}{l}\), we have
where σ_{F}= max(1−αμ,1−αl).
With the help of the previous lemmas, we are ready to derive a crucial contraction relationship in the proposed algorithm.
Contraction relationship
Our strategy to show convergence is to bound ∥x_{k+1}−Y_{∞}x_{k+1}∥, ∥Y_{∞}x_{k+1}−1_{n}⊗x^{∗}∥_{2}, and ∥z_{k+1}−Y_{∞}z_{k+1}∥ as a linear function of their values in the last iteration and ∇f(x_{k}); this approach extends the work in [28] on doubly stochastic weights to rowstochastic weights. We will present this relationship in the next lemmas. Before we proceed, we note that since all vector norms are equivalent in \(\mathbb {R}^{np}\), there exist positive constants c,d such that: ∥·∥_{2}≤c∥·∥,∥·∥≤d∥·∥_{2}. First, we derive a bound for ∥x_{k+1}−Y_{∞}x_{k+1}∥, the consensus error of the agents.
Lemma 6
The following inequality holds, ∀k:
where d is the equivalencenorm constant such that ∥·∥≤d∥·∥_{2} and \(\overline {\alpha }\) is the largest stepsize among the agents.
Proof
Note that Y_{∞}A=Y_{∞}. Using Eq. (11b) and Lemma 1, we have:
which completes the proof. □
Next, we derive a bound for ∥Y_{∞}x_{k+1}−1_{n}⊗x^{∗}∥_{2}, i.e., the optimality gap between the accumulation state of the network, Y_{∞}x_{k+1}, and the optimal solution, 1_{n}⊗x^{∗}.
Lemma 7
If \(\boldsymbol {\pi }_{r}^{\top }\boldsymbol {\alpha }<\frac {2}{nl}\), the following inequality holds, ∀k:
where \(\lambda =\max \left (\left 1n\boldsymbol {\pi }_{r}^{\top }\boldsymbol {\alpha }\mu \right ,\left 1n\boldsymbol {\pi }_{r}^{\top }\boldsymbol {\alpha }l \right \right)\) and c is the equivalencenorm constant such that ∥·∥_{2}≤c∥·∥.
Proof
Recalling that \(Y_{\infty }=\left (\mathbf {1}_{n} \boldsymbol {\pi }_{r}^{\top }\right)\otimes I_{p}\) and Y_{∞}A=Y_{∞}, we have the following:
Since the last term in the inequality above matches the second last term in Eq. (14), we only need to handle the first term. We further note that:
Now, we derive a upper bound for the first term in Eq. (15)
If \(\boldsymbol {\pi }_{r}^{\top }\boldsymbol {\alpha }<\frac {2}{nl}\), according to Lemma 5
where \(\lambda =\max \left (\left 1n\boldsymbol {\pi }_{r}^{\top }\boldsymbol {\alpha }\mu \right ,\left 1n\boldsymbol {\pi }_{r}^{\top }\boldsymbol {\alpha }l \right \right)\).
Next we derive a bound for s_{2}
where it is straightforward to bound s_{3} as
Since \(Y_{\infty }\widetilde {Y}_{\infty }^{1}=\left (\mathbf {1}_{n}\mathbf {1}_{n}^{\top }\right)\otimes I_{p}\) and \(Y_{\infty }\mathbf {z}_{k}=Y_{\infty }\widetilde {Y}_{k}^{1}\nabla \mathbf {f}(\mathbf {x}_{k})\) from Lemma 4, we have:
where we use Lemma 3. Combining Eqs. (15)–(20), we finish the proof. □
Next, we bound ∥z_{k+1}−Y_{∞}z_{k+1}∥, the error in gradient estimation.
Lemma 8
The following inequality holds, ∀k
Proof
According to Eq. (11c) and Lemma 1, we have:
Note that \(Y_{\infty }\mathbf {z}_{k}=Y_{\infty }\widetilde {Y}_{k}^{1}\nabla \mathbf {f}(\mathbf {x}_{k})\) from Lemma 4. Therefore,
where in the last inequality, we use Lemma 3. We now bound ∥x_{k+1}−x_{k}∥_{2}.
where in the second inequality, we use the fact that (A−I_{np})Y_{∞} is a zero matrix. Combining Eqs. (21)–(23), we obtain the desired result. □
The last step is to bound ∥z_{k}∥_{2} in terms of ∥x_{k}−Y_{∞}x_{k}∥, ∥Y_{∞}x_{k}−1_{n}⊗x^{∗}∥_{2}, and ∥z_{k}−Y_{∞}z_{k}∥. Then, we can replace ∥z_{k}∥_{2} in Lemmas 6 and 8 by this bound in order to develop a LTI system inequality.
Lemma 9
The following inequality holds, ∀k:
Proof
Recall that \(Y_{\infty }\widetilde {Y}_{\infty }^{1}=(\mathbf {1}_{n}\otimes I_{p})\left (\mathbf {1}_{n}^{\top }\otimes I_{p}\right)\) and \(Y_{\infty }\mathbf {z}_{k}=Y_{\infty }\widetilde {Y}_{k}^{1}\nabla \mathbf {f}(\mathbf {x}_{k})\) from Lemma 4. We have the following:
where in the second inequality, we use the fact that \(\left (\mathbf {1}_{n}^{\top }\otimes I_{p}\right)\nabla \mathbf {f}(\mathbf {x}^{*})=0\), which is the optimality condition for Problem P1. □
Before the main result, we present an additional lemma from nonnegative matrix theory that will be helpful in establishing the linear convergence of FROST.
Lemma 10
(Theorem 8.1.29 in [49]) Let \(X\in \mathbb {R}^{n\times n}\) be a nonnegative matrix and \(\mathbf {x}\in \mathbb {R}^{n}\) be a positive vector. If Xx<ωx, then ρ(X)<ω.
Main results
With the help of the auxiliary relationships developed in the previous subsection, we now present the main results as follows in Theorems 1 and 2. Theorem 1 states that the relationships derived in the previous subsection indeed provide a contraction when the largest stepsize, \(\overline {\alpha }\), is sufficiently small. Theorem 2 then establishes the linear convergence of FROST.
Theorem 1
If \(\boldsymbol {\pi }_{r}^{\top }\boldsymbol {\alpha }<\frac {2}{nl}\), the following LTI system inequality holds:
where \(\mathbf {t}_{k},\mathbf {s}_{k}\in \mathbb {R}^{3}\), and \(J_{\boldsymbol {\alpha }},H_{k}\in \mathbb {R}^{3\times 3}\) are defined as follows:
and the constants a_{i}’s are
Let [ π_{r}]_{−} be the smallest element in π_{r}. When the largest stepsize, \(\overline {\alpha }\), satisfies
with positive constants δ_{1},δ_{2},δ_{3} such that
then the spectral radius of J_{α} is strictly less than 1.
Proof
Combining Lemmas 6–9, one can verify that Eq. (26) holds if \(\boldsymbol {\pi }_{r}^{\top }\boldsymbol {\alpha }<\frac {2}{nl}\). Recall that \(\lambda =\max \left (\left 1\mu n\boldsymbol {\pi }_{r}^{\top }\boldsymbol {\alpha }\right ,\left 1l n\boldsymbol {\pi }_{r}^{\top }\boldsymbol {\alpha } \right \right)\). When \(\boldsymbol {\pi }_{r}^{\top }\boldsymbol {\alpha }<\frac {1}{nl}\), \(\lambda =1\mu n\boldsymbol {\pi }_{r}^{\top }\boldsymbol {\alpha }\), since μ≤l [59]. In order to make \(\boldsymbol {\pi }_{r}^{\top }\boldsymbol {\alpha }<\frac {1}{nl}\) hold, it is suffice to require \(\overline {\alpha }<\frac {1}{nl}\). The next step is to find an upper bound, \(\hat {\alpha }\), on the largest stepsize such that ρ(J_{α})<1 when \(\overline {\alpha }<\hat {\alpha }.\) In the light of Lemma 10, we solve for the range of the largest stepsize, \(\overline {\alpha }\), and a positive vector δ=[δ_{1},δ_{2},δ_{3}]^{⊤} from the following:
which is equivalent to the following set of inequalities:
Since the right hand side of the third inequality in Eq. (30) has to be positive, we have that:
In order to find the range of δ_{2} such that the second inequality holds, it suffices to solve for the range of δ_{2} such that the following inequality holds:
where [ _r]_{−} is the smallest entry in π_{r}. Therefore, as long as
the second inequality in Eq. (30) holds. The next step is to solve the range of \(\overline {\alpha }\) from the first and third inequalities in Eq. (30). We get
where the range of δ_{1} and δ_{2} is given in Eqs. (31) and (32), respectively, and δ_{3} is an arbitrary positive constant and the theorem follows. □
Note that δ_{1},δ_{2},δ_{3} are essentially adjustable parameters that are chosen independently from the stepsizes. Specifically, according to Eq. (28), we first choose an arbitrary positive constant δ_{3} and subsequently choose a constant δ_{1} such that \(0< \delta _{1} < \frac {(1\sigma)\delta _{3}}{a_{6}}\) and finally we choose a constant δ_{2} such that \(\delta _{2}>\frac {a_{4}\delta _{1}+a_{5}\delta _{3}}{\mu n[\boldsymbol {\pi }_{r}]_{}}\).
Theorem 2
If the largest stepsize \(\overline {\alpha }\) follows the bound in Eq. (27), we have:
where ξis an arbitrarily small constant, σ is the contraction factor defined in Lemma 1, and m is some positive constant.
Noticing that ρ(J_{α})<1 when the largest stepsize, \(\overline {\alpha }\), follows the bound in Eq. (27) and that H_{k} linearly decays at the rate of σ^{k}, one can intuitively verify Theorem 2. A rigorous proof follows from [45].
In Theorems 1 and 2, we establish the linear convergence of FROST when the largest stepsize, \(\overline {\alpha }\), follows the upper bound defined in Eq. (27). Distributed optimization (based on gradient tracking) with uncoordinated stepsizes have been previously studied in [26, 46, 47], over undirected graphs with doubly stochastic weights, and in [48], over directed graphs with columnstochastic weights. These works rely on some notion of heterogeneity of the stepsizes, defined respectively as the relative deviation of the stepsizes from their average, \(\frac {\(I_{n}U)\boldsymbol {\alpha }\_{2}}{\U\boldsymbol {\alpha }\_{2}}\), where \(U=\mathbf {1}_{n}\mathbf {1}_{n}^{\top }/n\), in [26, 46], and as the ratio of the largest to the smallest stepsize, \(\frac {\max _{i}\{\alpha _{i}\}}{\min _{i}\{\alpha _{i}\}}\), in [47, 48]. The authors then show that when the heterogeneity is small enough and when the largest stepsize follows a bound that is a function of the heterogeneity, the proposed algorithms converge to the optimal solution. It is worth noting that sufficiently small stepsizes cannot guarantee sufficiently small heterogeneity in both of the aforementioned definitions. In contrast, the upper bound on the largest stepsize in this paper, Eqs. (27) and (28), is independent of any notion of heterogeneity and only depends on the objective functions and the network parameters^{Footnote 3}. Each agent therefore locally picks a sufficiently small stepsize independent of other stepsizes. Besides, this bound allows the agents to choose a zero stepsize as long as at least one of them is positive and sufficiently small.
Numerical results
In this section, we use numerical experiments to support the theoretical results. We consider a distributed logistic regression problem. Each agent i has access to m_{i} training data, \((\mathbf {c}_{ij},y_{ij})\in \mathbb {R}^{p}\times \{1,+1\}\), where c_{ij} contains p features of the jth training data at agent i and y_{ij} is the corresponding binary label. The network of agents cooperatively solves the following distributed logistic regression problem:
with each private loss function being
where \(\frac {\lambda }{2}\\mathbf {w}\_{2}^{2}\) is a regularization term used to prevent overfitting of the data. The feature vectors, c_{ij}’s, are randomly generated from some Gaussian distribution with zero mean. The binary labels are randomly generated from some Bernoulli distribution. The network topology is shown in Fig. 1. We adopt a simple uniform weighting strategy to construct the row and columnstochastic weights when needed: \(a_{ij}=1/\mathcal {N}_{i}^{{ \text {in}}},~b_{ij}=1/\mathcal {N}_{j}^{{ \text {out}}},~\forall i,j\). We plot the average of residuals at each agent, \(\frac {1}{n}\sum _{i=1}^{n}\\mathbf {x}_{i}(k)\mathbf {x}^{*}\_{2}\). In Fig. 2 (left), each curve represents the linear convergence of FROST when the corresponding agent uses a positive stepsize, optimized manually, while every other agent uses zero stepsize.
In Fig. 2 (right), we compare the performance of FROST, with ADDOPT/PushDIGing [39, 40], see Section 3.2.3, and with the \(\mathcal {AB}\) algorithm in [42, 43], see Section 3.2.4. The stepsize used in each algorithm is optimized. For FROST, we first manually find the optimal identical stepsize for all agents, which is 0.07 in our experiment, and then randomly generate uncoordinated stepsizes of FROST from the uniform distribution over the interval [0,0.07] (therefore, the convergence speed of FROST shown in this experiment is conservative). The numerical experiments thus verify our theoretical finding that as long as the largest stepsize of FROST is positive and sufficiently small, FROST linearly converges to the optimal solution.
In the next experiment, we show the influence of the network sparsity on the convergence of FROST. For this purpose, we use three different graphs each with n=50 nodes, where \(\mathcal {G}_{1}\) has roughly 10% of total edges, \(\mathcal {G}_{2}\) has roughly 13% of total edges, and \(\mathcal {G}_{3}\) has roughly 16% of total edges. These graphs are shown in Fig. 3, and the performance of FROST over each one of them is shown in Fig. 4.
Conclusions
In this paper, we consider distributed optimization applicable to both directed and undirected graphs with rowstochastic weights and when the agents in the network have uncoordinated stepsizes. Most of the existing algorithms are based on columnstochastic weights, which may be infeasible to implement in many practical scenarios. Rowstochastic weights, on the other hand, are straightforward to implement as each agent locally determines the weights assigned to each incoming information. We propose a fast algorithm that we call FROST (Fast Rowstochastic Optimization with uncoordinated STepsizes) and show that when the largest stepsize is positive and sufficiently small, FROST linearly converges to the optimal solution. Simulation results further verify the theoretical analysis.
Notes
 1.
EXTRA [21] is another related algorithm, which uses the difference between two consecutive DGD iterates to achieve linear convergence to the optimal solution.
 2.
See [32, 33] for related work with sublinear rate based on surplus consensus [55].
 3.
The constants δ_{1}, δ_{2}, and δ_{3} in Eqs. (27) and (28) are tunable parameters that only depend on the network topology and objective functions.
Abbreviations
 ADDOPT:

(Accelerated Distributed Directed OPTimization)
 DGD:

(Distributed Gradient Descent)
 EXTRA:

(EXact firsTordeR Algorithm)
 FROST:

(Fast RowstochasticOptimization with uncoordinated STepsizes)
References
 1
P. A. Forero, A. Cano, G. B. Giannakis, Consensusbased distributed support vector machines. J. Mach. Learn. Res.11(May), 1663–1707 (2010).
 2
S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn.3(1), 1–122 (2011). https://doi.org/10.1561/2200000016.
 3
H. Raja, W.U. Bajwa, Cloud KSVD: A collaborative dictionary learning algorithm for big, distributed data. IEEE Trans. Signal Process.64(1), 173–188 (2016).
 4
H. T. Wai, Z. Yang, Z. Wang, M. Hong, Multiagent reinforcement learning via double averaging primaldual optimization. arXiv preprint arXiv:1806.00877 (2018).
 5
P. Di Lorenzo, AH. Sayed, Sparse distributed learning based on diffusion adaptation. IEEE Trans. Signal rocess.61(6), 1419–1433 (2013).
 6
S. Scardapane, R. Fierimonte, P. Di Lorenzo, M. Panella, A. Uncini, Distributed semisupervised support vector machines. Neural Netw.80:, 43–52 (2016).
 7
A. Jadbabaie, J. Lin, A. Morse, Coordination of groups of mobile autonomous agents using nearest neighbor rules. IEEE Trans. Autom. Control. 48(6), 988–1001 (2003).
 8
G. Mateos, J. A. Bazerque, G. B. Giannakis, Distributed sparse linear regression. IEEE Trans. Signal Process.58(10), 5262–5276 (2010). https://doi.org/10.1109/TSP.2010.2055862.
 9
J. A. Bazerque, G. B. Giannakis, Distributed spectrum sensing for cognitive radio networks by exploiting sparsity. IEEE Trans. Signal Process.58(3), 1847–1862 (2010). https://doi.org/10.1109/TSP.2009.2038417.
 10
M. Rabbat, R. Nowak, in 3rd International Symposium on Information Processing in Sensor Networks. Distributed optimization in sensor networks (IEEE, 2004), pp. 20–27. https://doi.org/10.1109/IPSN.2004.1307319.
 11
S. Safavi, U. A. Khan, S. Kar, J. M. F. Moura, Distributed localization: a linear theory. Proc. IEEE. 106(7), 1204–1223 (2018). https://doi.org/10.1109/JPROC.2018.2823638.
 12
J. Tsitsiklis, D. P. Bertsekas, M. Athans, Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE Trans. Autom. Control. 31(9), 803–812 (1986).
 13
A. Nedić, A. Ozdaglar, Distributed subgradient methods for multiagent optimization. IEEE Trans. Autom. Control. 54(1), 48–61 (2009). https://doi.org/10.1109/TAC.2008.2009515.
 14
K. Yuan, Q. Ling, W. Yin, On the convergence of decentralized gradient descent. SIAM J. Optim.26(3), 1835–1854 (2016).
 15
A. S. Berahas, R. Bollapragada, N. S. Keskar, E. Wei, Balancing communication and computation in distributed optimization. arXiv preprint arXiv:1709.02999 (2017).
 16
H. Terelius, U. Topcu, R. M. Murray, Decentralized multiagent optimization via dual decomposition. IFAC Proc. Vol.44(1), 11245–11251 (2011).
 17
J. F. C. Mota, J. M. F. Xavier, P. M. Q. Aguiar, M. Puschel, DADMM: a communicationefficient distributed algorithm for separable optimization. IEEE Trans. Signal Process.61(10), 2718–2723 (2013). https://doi.org/10.1109/TSP.2013.2254478.
 18
E. Wei, A. Ozdaglar, in 51st IEEE Annual Conference on Decision and Control. Distributed alternating direction method of multipliers (IEEE, 2012), pp. 5445–5450. https://doi.org/10.1109/CDC.2012.6425904.
 19
W. Shi, Q. Ling, K. Yuan, G. Wu, W. Yin, On the linear convergence of the ADMM in decentralized consensus optimization. IEEE Trans. Signal Process.62(7), 1750–1761 (2014). https://doi.org/10.1109/TSP.2014.2304432.
 20
D. Jakovetic, J. Xavier, J. M. F. Moura, Fast distributed gradient methods. IEEE Trans. Autom. Control. 59(5), 1131–1146 (2014).
 21
W. Shi, Q. Ling, G. Wu, W. Yin, Extra: An exact firstorder algorithm for decentralized consensus optimization. SIAM J. Optim.25(2), 944–966 (2015). https://doi.org/10.1137/14096668X. http://arxiv.org/abs/http://dx.doi.org/10.1137/14096668X.
 22
K. Yuan, B. Ying, X. Zhao, A. H. Sayed, Exact diffusion for distributed optimization and learning  part I: algorithm development. IEEE Trans. Signal Process, 1 (2018). https://doi.org/10.1109/TSP.2018.2875898.
 23
K. Yuan, B. Ying, X. Zhao, A. H. Sayed, Exact diffusion for distributed optimization and learning  part II: convergence analysis. IEEE Trans. Signal Process.https://doi.org/10.1109/TSP.2018.2875883 (2018).
 24
A. H. Sayed, in Academic Press Library in Signal Processing vol. 3. Diffusion adaptation over networks (ElsevierAmsterdam, 2014), pp. 323–453.
 25
M. Zhu, S. Martínez, Discretetime dynamic average consensus. Automatica. 46(2), 322–329 (2010).
 26
J. Xu, S. Zhu, Y. C. Soh, L. Xie, in IEEE 54th Annual Conference on Decision and Control. Augmented distributed gradient methods for multiagent optimization under uncoordinated constant stepsizes (IEEE, 2015), pp. 2055–2060.
 27
P. D. Lorenzo, G. Scutari, in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Distributed nonconvex optimization over timevarying networks (IEEE, 2016), pp. 4124–4128.
 28
G. Qu, N. Li, Harnessing smoothness to accelerate distributed optimization. IEEE Trans. Control Netw. Syst.5(3), 1245–1260 (2018). https://doi.org/10.1109/TCNS.2017.2698261.
 29
P. Di Lorenzo, G. Scutari, Next: Innetwork nonconvex optimization. IEEE Trans. Signal Inf. Process. Over Netw.2(2), 120–136 (2016).
 30
K. I. Tsianos, S. Lawlor, M. G. Rabbat, in 51st IEEE Annual Conference on Decision and Control. Pushsum distributed dual averaging for convex optimization (IEEE, 2012), pp. 5453–5458. https://doi.org/10.1109/CDC.2012.6426375.
 31
A. Nedić, A. Olshevsky, Distributed optimization over timevarying directed graphs. IEEE Trans. Autom. Control. 60(3), 601–615 (2015).
 32
C. Xi, Q. Wu, U. A. Khan, On the distributed optimization over directed networks. Neurocomputing. 267:, 508–515 (2017).
 33
C. Xi, U. A. Khan, Distributed subgradient projection algorithm over directed graphs. IEEE Trans. Autom. Control. 62(8), 3986–3992 (2016).
 34
M. Assran, M. Rabbat, Asynchronous subgradientpush. arXiv preprint arXiv:1803.08950 (2018).
 35
J. Zhang, K. You, Asyspa: An exact asynchronous algorithm for convex optimization over digraphs. arXiv preprint arXiv:1808.04118 (2018).
 36
A. Olshevsky, I. C. Paschalidis, A. Spiridonoff, Robust asynchronous stochastic gradientpush: asymptotically optimal and networkindependent performance for strongly convex functions. arXiv preprint arXiv:1811.03982 (2018).
 37
C. Xi, U. A. Khan, DEXTRA: a fast algorithm for optimization over directed graphs. IEEE Trans. Autom. Control. 62(10), 4980–4993 (2017).
 38
D. Kempe, A. Dobra, J. Gehrke, in 44th Annual IEEE Symposium on Foundations of Computer Science. Gossipbased computation of aggregate information (IEEE, 2003), pp. 482–491. https://doi.org/10.1109/SFCS.2003.1238221.
 39
C. Xi, R. Xin, U. A. Khan, ADDOPT: accelerated distributed directed optimization. IEEE Trans. Autom. Control. 63(5), 1329–1339 (2017).
 40
A. Nedić, A. Olshevsky, W. Shi, Achieving geometric convergence for distributed optimization over timevarying graphs. SIAM J. Optim.27(4), 2597–2633 (2017).
 41
Y. Sun, G. Scutari, D. Palomar, in 2016 50th Asilomar Conference on Signals, Systems and Computers. Distributed nonconvex multiagent optimization over timevarying networks (IEEEPacific Grove, 2016), pp. 788–794.
 42
R. Xin, U. A. Khan, A linear algorithm for optimization over directed graphs with geometric convergence. IEEE Control Syst. Lett.2(3), 325–330 (2018).
 43
R. Xin, U. A. Khan, Distributed heavyball: a generalization and acceleration of firstorder methods with gradient tracking. arXiv preprint arXiv:1808.02942 (2018).
 44
V. S. Mai, E. H. Abed, in 2016 American Control Conference (ACC). Distributed optimization over weighted directed graphs using row stochastic matrix (IEEE, 2016), pp. 7165–7170. https://doi.org/10.1109/ACC.2016.7526803.
 45
C. Xi, V. S. Mai, R. Xin, E. Abed, U. A. Khan, Linear convergence in optimization over directed graphs with rowstochastic matrices. IEEE Trans. Autom. Control. (2018). in press.
 46
J. Xu, S. Zhu, Y. Soh, L. Xie, Convergence of asynchronous distributed gradient methods over stochastic networks. IEEE Trans. Autom. Control. 63(2), 434–448 (2018).
 47
A. Nedić, A. Olshevsky, W. Shi, C. A. Uribe, in 2017 American Control Conference (ACC). Geometrically convergent distributed optimization with uncoordinated stepsizes (IEEE, 2017), pp. 3950–3955. https://doi.org/10.23919/ACC.2017.7963560.
 48
Q. Lü, H. Li, D. Xia, Geometrical convergence rate for distributed optimization with timevarying directed graphs and uncoordinated stepsizes. Inf. Sci.422:, 516–530 (2018).
 49
R. A. Horn, C. R. Johnson, Matrix Analysis, 2nd ed (Cambridge University Press, New York, NY, 2013).
 50
H. W. Kuhn, The Hungarian method for the assignment problem. Nav. Res. Logist. Q.2(1–2), 83–97 (1955).
 51
S. Safavi, U. A. Khan, in 48th IEEE Asilomar Conference on Signals, Systems, and Computers. On the convergence rate of swapcollide algorithm for simple task assignment (IEEE, 2014), pp. 1507–1510.
 52
M. Zhu, S. Martínez, Discretetime dynamic average consensus. Automatica. 46(2), 322–329 (2010).
 53
F. Benezit, V. Blondel, P. Thiran, J. Tsitsiklis, M. Vetterli, in IEEE International Symposium on Information Theory. Weighted gossip: distributed averaging using nondoubly stochastic matrices (IEEE, 2010), pp. 1753–1757. https://doi.org/10.1109/ISIT.2010.5513273.
 54
F. Saadatniaki, R. Xin, U.A. Khan, Optimization over timevarying directed graphs with row and columnstochastic matrices. arXiv preprint arXiv:1810.07393 (2018).
 55
K. Cai, H. Ishii, Average consensus on general strongly connected digraphs. Automatica. 48(11), 2750–2761 (2012). https://doi.org/10.1016/j.automatica.2012.08.003.
 56
T. Yang, J. George, J. Qin, X. Yi, J. Wu, Distributed finitetime least squares solver for network linear equations. arXiv preprint arXiv:1810.00156 (2018).
 57
R. A. Horn, C. R. Johnson, Matrix Analysis (Cambridge University Press, New York, NY, 2013).
 58
D. P. Bertsekas, Nonlinear Programming (Athena scientific Belmont, Belmont, 1999).
 59
Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course vol. 87 (Springer, New York, 2013).
Acknowledgments
Not applicable.
Funding
This work has been partially supported by an NSF Career Award # CCF1350264. None of the authors have any competing interests in the manuscript.
Availability of data and materials
Data sharing not applicable to this article as no datasets were generated or analysed during the current study.
Author information
Affiliations
Contributions
All authors contributed equally to this paper. All authors read and approved the final manuscript.
Corresponding author
Correspondence to Usman A. Khan.
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional information
Authors’ information
Ran Xin received his B.S. degree in Mathematics and Applied Mathematics from Xiamen University, China, in 2016. and M.S. in Electrical Engineering from Tufts University, USA, in 2018. Currently, he is a Ph.D. student in the Electrical and Computer Engineering department at Tufts University. His research interests include optimization, control, and statistical learning theory.
Chenguang Xi received his B.S. degree in Microelectronics from Shanghai JiaoTong University, China, in 2010, M.S. and Ph.D. degrees in Electrical and Computer Engineering from Tufts University, in 2012 and 2016, respectively. Currently, he is a research scientist in Applied Machine Learning at Facebook Inc. His research interests include machine learning and artificial intelligence.
Dr. Usman Khan has been an Associate Professor of Electrical and Computer Engineering (ECE) at Tufts University, Medford, MA, USA, since September 2017, where he is the Director of Signal Processing and Robotic Networks laboratory. His research interests include statistical signal processing, network science, and distributed optimization over autonomous multiagent systems. He has published extensively in these topics with more than 75 articles in journals and conference proceedings and holds multiple patents. Recognition of his work includes the prestigious National Science Foundation (NSF) Career award, several NSF REU awards, an IEEE journal cover, three best student paper awards in IEEE conferences, and several news articles including one on IEEE spectrum. Dr. Khan joined Tufts as an Assistant Professor in 2011 and held a Visiting Professor position at KTH, Sweden, in Spring 2015. Prior to joining Tufts, he was a postdoc in the GRASP lab at the University of Pennsylvania. He received his B.S. degree in 2002 from University of Engineering and Technology, Pakistan, M.S. degree in 2004 from University of WisconsinMadison, USA, and Ph.D. degree in 2009 from Carnegie Mellon University, USA, all in ECE. Dr. Khan is an IEEE senior member and has been an associate member of the Sensor Array and Multichannel Technical Committee with the IEEE Signal Processing Society since 2010. He is an elected member of the IEEE Big Data special interest group and has served on the IEEE Young Professionals Committee and on IEEE Technical Activities Board. He was an editor of the IEEE Transactions on Smart Grid from 2014 to 2017, and is currently an associate editor of the IEEE Control System Letters. He has served on the Technical Program Committees of several IEEE conferences and has organized and chaired several IEEE workshops and sessions.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Received
Accepted
Published
DOI
Keywords
 Distributed optimization
 Multiagent systems
 Directed graphs
 Linear convergence