 Research
 Open Access
 Published:
Communication efficient distributed weighted nonlinear least squares estimation
EURASIP Journal on Advances in Signal Processingvolume 2018, Article number: 66 (2018)
Abstract
The paper addresses design and analysis of communicationefficient distributed algorithms for solving weighted nonlinear least squares problems in multiagent networks. Communication efficiency is highly relevant in modern applications like cyberphysical systems and the Internet of things, where a significant portion of the involved devices have energy constraints in terms of limited battery power. Furthermore, nonlinear models arise frequently in such systems, e.g., with power grid state estimation. In this paper, we develop and analyze a nonlinear communicationefficient distributed algorithm dubbed \(\mathcal {CREDONL}\) (nonlinear \(\mathcal {CREDO}\)). \(\mathcal {CREDONL}\) generalizes the recently proposed linear method \(\mathcal {CREDO}\) (Communication efficient REcursive Distributed estimatOr) to nonlinear models. We establish for a broad class of nonlinear least squares problems and generic underlying multiagent network topologies \(\mathcal {CREDONL}\)’s strong consistency. Furthermore, we demonstrate communication efficiency of the method, both theoretically and by simulation examples. For the former, we rigorously prove that \(\mathcal {CREDONL}\) achieves significantly faster mean squared error rates in terms of the elapsed communication cost over existing alternatives. For the latter, the considered simulation experiments show communication savings by at least an order of magnitude.
Introduction
We consider distributed nonlinear least squares estimation in networked systems. The networked system considered consists of heterogeneous networked entities or agents where the interagent collaboration conforms to a preassigned possibly sparse communication graph. The agents acquire their local, noisy, nonlinear observations about the unknown phenomenon (unknown static vector parameter θ) in a streaming fashion over discrete time instances t. The goal for each agent is to continuously generate an estimate of θ over time instances t in a recursive fashion, where the estimate update of an agent involves simultaneous assimilation of the newly acquired local observations, and the received information through messages with agents in its immediate neighborhood. The assumed setup is highly relevant in several emerging applications in the context of cyberphysical systems (CPS) and the Internet of things (IoT), like state estimation in smart grid, predictive maintenance, and production monitoring in industrial manufacturing systems. For example, with continuous state estimation of a smart grid, the acquired measurements (voltages, angles) are in general nonlinear functions of the unknown state; further, the measurements are inherently distributed across different physical locations (elements of the system), and they arrive continuously over time with a prescribed sampling rate. Furthermore, the scale (network size) of the distributed system (e.g., a large scale microgrid) and near realtime requirements on the estimation results make distributed, fusion centerfree processing a desirable choice.
An important aspect of distributed estimation algorithms in the context of the applications described above is communication efficiency, i.e., achieving good estimation performance with minimal communication cost. Realworld applications such as largescale deployment of CPS or IoT typically involve entities or agents with limited on board energy resources. In addition to the limited on board power, the energy requirement per unit communication is usually significantly higher than the energy requirement per unit computation [48]. Hence, communication efficiency is a highly desirable trait in such systems. Moreover, for largescale systems which require continuous system monitoring, it is crucial to reduce the communication cost as much as possible without compromising on the performance of the inference task at hand, which then ensure longer lifetime of such systems.
In this paper, we propose and analyze a communication efficient distributed estimator for nonlinear observation models that we refer to as \(\mathcal {CREDONL}\). The estimator \(\mathcal {CREDONL}\) generalizes the recently proposed linear distributed estimator \(\mathcal {CREDO}\), see [37, 38], that is designed and works for linear measurement (observation) models only. Specific contributions of the paper are as follows.
We propose the nonlinear distributed estimator \(\mathcal {CREDONL}\) that works for a broad class of nonlinear observation models and where the model information in terms of the node i’s sensing function and noise statistic is only available at the individual agent i itself. With the proposed algorithm, each agent communicates probabilistically sparsely over time. More precisely, the probability which determines whether a node communicates at time t decays sublinearly to zero with t, which then makes the communication cost scale sublinear with time t.
Despite dropping communications and the presence of nonlinearities in the sensing model, we show that the proposed algorithm achieves the optimal O(1/t) rate of the mean square error (MSE) decay^{Footnote 1}. The achievability of the optimal MSE decay in terms of time t translates into significant improvements in the rate at which MSE scales with respect to the peragent average communication cost \(\mathcal {C}_{t}\) up to time t, namely from \(O(1/\mathcal {C}_{t})\) with existing methods, e.g., [15, 16, 31, 34–36, 40], to \(O\left (1/\mathcal {C}_{t}^{2\zeta }\right)\) with the proposed method, where ζ>0 is arbitrarily small. We also establish strong consistency of the estimate sequence at each agent, showing that each agent’s local estimator converges almost surely to the true parameter θ. Simulation examples confirm significant communication savings of \(\mathcal {CREDONL}\) over existing alternatives, by at least an order of magnitude.
We now briefly review the literature on distributed inference and motivate our algorithm \(\mathcal {CREDONL}\). Distributed inference algorithms can be broadly divided into two classes based on the presence of a fusion center. The first class assumes presence of a fusion center, e.g., [11, 23, 26, 27, 47]. The fusion center assigns subtasks to the individual agents and subsequently fuses the information from different agents. However, when the data samples are geographically distributed across the individual agents and are streamed in time, fusion centerbased solutions are impractical.
The second class of distributed inference methods is fusion centerfree. These works typically assume that the agents are interconnected over a generic network, and each agent acquires its local measurements in a streaming fashion. These estimators are iterative (recursive), where at each iteration (time instance), each agent assimilates its new measurement and exchanges messages with its immediate neighbors, see, e.g., [2, 4–6, 14, 20, 22, 24, 25, 28–31, 34–36, 39, 43, 46]. Most related to our work are references that consider distributed estimation under nonlinear observation models, as we do here, or distributed convex stochastic optimization, e.g., [15, 16, 31, 34–36, 40]. However, among these works, the best achieved MSE communication rate is \(O(1/\mathcal {C}_{t})\). In contrast, we establish here a strictly faster MSE communication rate equal to \(O\left (1/\mathcal {C}_{t}^{2\zeta }\right)\) (ζ>0 is arbitrarily small). Finally, it is worth noting that there exist a few distributed algorithms (without fusion node) that are also designed to achieve communication efficiency, e.g., [13, 21, 44–46]. In [46], a data censoring method is employed to save in terms of computation and communication costs. However, the communication savings in [46] is a constant proportion with respect to a vanilla method which uses all allowable communications at all times. In [21], the communication savings come at a cost of extra computations. References [13, 44, 45] also consider a different setup than we do here, namely they study distributed optimization (with no fusion center) where the data is available a priori (i.e., it is not streamed). In terms of the strategy to save communications, references [13, 21, 44, 45] consider, respectively, deterministically increasingly sparse communication, adaptive communication scheme, and selective activation of agents. These strategies are different from ours that utilizes a randomized, increasing, “sparsification” of communications.
Consensus+innovations methods, see, e.g., [16, 17, 19, 20]), are a subclass of distributed recursive algorithms (the second class of algorithms mentioned above) that process data in a streaming fashion. With consensus+innovation methods, each node updates its estimate at each iteration twofold: by weightaveraging its solution estimate (consensus) with the neighbors’ solution estimates and by assimilating its newly acquired data sample (innovation). Therein, the consensus and innovation weights are usually timevarying and are carefully designed towards achieving optimal asymptotic performance, measured, e.g., through asymptotic covariance of the estimate sequence. Within the class of consensus+innovations distributed estimation algorithms (see, e.g., [18, 20]), the design of communication efficient methods has been addressed in [37], see also [38], for linear observation models, wherein a mixed timescale stochastic approximation method dubbed \(\mathcal {CREDO}\) has been proposed. We extend here \(\mathcal {CREDO}\) to nonlinear observation models. Technically speaking, establishing convergence and asymptotic rates of convergence for \(\mathcal {CREDONL}\) involves establishing guarantees for existence of stochastic Lyapunov functions for the estimate sequence. The update of the estimate sequence in \(\mathcal {CREDONL}\) involves a gain matrix which is in turn a function of the estimate itself. Moreover, in addition to the gain matrix being a function of the estimate, the sensing functions exhibit localized behavior in terms of smoothness and global observability in the proposed algorithm. Hence, the setup considered in this paper requires technical tools different from \(\mathcal {CREDO}\), which we develop in this paper.
The rest of the paper is organized as follows. Section 2 describes the problem that we consider and gives the needed preliminaries on conventional (centralized) and distributed recursive estimation. Section 3 presents the novel \(\mathcal {CREDONL}\) algorithm that we propose, while Section 4 states our main results on the algorithm’s performance. Section 5 presents the simulations experiments, and finally, we conclude in Section 7. Proofs of the main results are relegated to Appendix A.
Model and preliminaries
Sensing and network models
Let θ∈Θ, where \(\Theta \subset \mathbb {R}^{M}\) (the properties of it to be specified shortly) be an Mdimensional parameter that is to be estimated by a network of N agents. Every agent n at time index t makes a noisy observation y_{n}(t), a noisy function of θ. Formally, the observation model for the nth agent is given by,
where \(\mathbf {f}_{n}:\mathbb {R}^{M}\mapsto \mathbb {R}^{M_{n}}\) is a nonlinear sensing function, where M_{n}≪M, \(\{\mathbf {y}_{n}(t)\} \in \mathbb {R}^{M_{n}}\) is the observation sequence for the nth agent and {γ_{n}(t)} is a zero mean temporally independent and identically distributed (i.i.d.) noise sequence at the nth agent with nonsingular covariance R_{n}, where \(\mathbf {R}_{n}\in \mathbb {R}^{M_{n}\times M_{n}}\). The noise processes are independent across different agents. We state an assumption on the noise processes before proceeding further. Throughout, we denote by ∥·∥ the \(\mathcal {L}_{2}\)norm of its vector or matrix argument and by \(\mathbb {E} [.]\) the expectation operator.
Assumption 1
There exists ε_{1}>0, such that, for all n, \(\mathbb {E} \left [\left \\gamma _{n}(t)\right \^{2+\epsilon _{1}}\right ]<\infty \).
We remark that the main results of the paper (Theorems 4.1 and 4.2) continue to hold even if ε_{1}=0^{Footnote 2}. The above assumption encompasses a general class of noise distributions in the setup.
The heterogeneity of the setup is exhibited in terms of the agent dependent sensing functions and the noise covariances at the agents. Each agent is interested in reconstructing the true underlying parameter θ. We assume an agent is aware only of its local observation model, i.e, the nonlinear sensing function f_{n}(·) and the associated noise covariance R_{n}, and hence, it has no information about the observation matrix and noise processes of other agents.
The agents are interconnected through a communication network that we shall assume throughout the paper is modeled as an undirected simple connected graph G=(V,E), with V=[1⋯N] and E denoting the set of agents (nodes) and communication links, see [3]. (With the proposed \(\mathcal {CREDONL}\) method, the available links in E will be activated selectively across algorithm iterations in a probabilistic fashion, as it will be detailed in Section 3). The neighborhood of node n in graph G is
The node n has degree d_{n}=Ω_{n}. The structure of the graph is described by the N×N adjacency matrix, A=A^{⊤}=[A_{nl}], A_{nl}=1, if (n,l)∈E, A_{nl}=0, otherwise. Let D=diag(d_{1}⋯d_{N}). The graph Laplacian L=D−A is positive semidefinite, with eigenvalues ordered as 0=λ_{1}(L)≤λ_{2}(L)≤⋯≤λ_{N}(L). The eigenvector of L corresponding to λ_{1}(L) is \((1/\sqrt {N})\mathbf {1}_{N}\). (Here, 1_{N} is the Ndimensional vector with all entries equal to one.) The multiplicity of its zero eigenvalue equals the number of connected components of the network; for a connected graph, λ_{2}(L)>0. This second eigenvalue is the algebraic connectivity or the Fiedler value of the network (see [7] for instance).
Example: distributed static phase estimation in smart grids
Many applications within cyber physical systems and the Internet of things can be modeled as nonlinear distributed estimation problems of type (1). Such class of models arises, e.g., with state estimation in power systems; therein, a phasorial representation of voltages and currents is usually utilized, wherein nonlinearity in general emerges from powerflow equations [1, 33]. Here, we focus on the specific problem within the class, namely distributed static phase estimation in smart grids. We describe the model briefly and refer to, e.g., [12, 19] for more details. Here, graph G corresponds to a power grid network of n=1,...,N generators and loads (here, a single generator or a single load is a node in the graph), while the edge set E corresponds to the set of transmission lines or interconnections. (For simplicity, even though not necessary, we assume that the physical interconnection network matches the internode communication network.) Assume that G is connected. The state of a node n is described by \((\mathcal {V}_{n},{\phi _{n}})\), where \(\mathcal {V}_{n}\) is the voltage magnitude and ϕ_{n} is the phase angle. As commonly assumed, e.g., [12], we let the voltages \(\mathcal {V}_{n}\) be known constants; on the other hand, angles ϕ_{n} are unknown ant are to be estimated. Following a standard approximation path, the real power flow across the transmission line between nodes n and l can be expressed as, e.g., [12]:
where ϕ is the vector that collects the unknown phase angles ϕ_{n} across all nodes, b_{nl} is line (n,l)’s admittance, and ϕ_{nl}=ϕ_{n}−ϕ_{l}. Denote by E_{m}⊂E the set of lines equipped with power flow measuring devices. The power flow measurement at line (n,l) is then given by:
where {γ_{nl}(t)} is the zero mean i.i.d. measurement noise with finite moment \(\mathbb {E}[\!\gamma _{nl}(t)^{2+\epsilon _{1}}]\), for some ε_{1}>0. Assume that each measurement y_{nl}(t) is assigned to one of its incident nodes n or l. Further, let \(\Omega _{n}^{\prime }\) denote the set of all indexes l such that measurements y_{nl}(t) are available at node n. Then, it becomes clear that the angle estimation problem is a special case of model (1), with the measurement vectors \(\mathbf {y}_{n}(t)=[y_{nl}(t),~l\in \Omega _{n}^{\prime }]^{\top }, n=1,...,N\), noise vectors \(\mathbf {\gamma }_{n}(t)=[\gamma _{nl}(t),~l\in \Omega _{n}^{\prime }]^{\top }\), n=1,...,N, and sensing functions \(\mathbf {f}_{n}(\boldsymbol {\phi })=[\mathcal {V}_{n}\,\mathcal {V}_{l}\,b_{nl}\, \sin (\phi _{nl}),~l\in \Omega _{n}^{\prime }]^{\top }. n=1,...,N\). It can be shown that under reasonable assumptions on phase angle ranges (that correspond to the admissible parameter set Θ) and the smart grid network and admittances structure, the assumptions we make on the sensing model are satisfied,^{Footnote 3} and hence, \(\mathcal {CREDONL}\) can be effectively applied; we refer to [12, 19] for details.
Preliminaries: centralized batch and recursive weighted nonlinear least squares estimation
In this subsection, we go over the preliminaries of centralized and distributed weighted nonlinear least squares estimation.
Consider a networked setup with a hypothetical fusion center which has access to the samples collected at all nodes at all times. In such a setting, in lieu of the sensing model as described in (1), one of the classical algorithms that finds extensive use is the weighted nonlinear least squares (WNLS) (see, for example, [15]). The applicability of WNLS to fairly generic setups which are characterized by the absence of noise statistics makes it particularly appealing in practice. We discuss properties of the WNLS estimator before proceeding further. Define the cost function \(\mathcal {Q}_{t}\) as follows:
The hypothetical fusion center in such a setting generates the estimate sequence \(\left \{{\widehat {\boldsymbol {\theta }}}_{t}\right \}\) in the following way:
The consistency and the asymptotic behavior of the estimate sequence \(\{{\widehat {\boldsymbol {\theta }}}_{t}\}\) have been analyzed in the literature under the following weak assumptions:
Assumption 2
The set Θ is compact convex subset of \(\mathbb {R}^{M}\) with nonempty interior int(Θ) and the true (but unknown) parameter θ∈int(Θ).
Assumption 3
The sensing model is globally observable, i.e., any pair \({\boldsymbol {\theta }}, \acute {{\boldsymbol {\theta }}}\) of possible parameter instances in Θ satisfies
if and only if \({\boldsymbol {\theta }}=\acute {{\boldsymbol {\theta }}}\).
Assumption 4
The sensing function f_{n}(.)for each n is continuously differentiable in the interior int(Θ) of the set Θ. For each θ in the set Θ, the (normalized) gain matrix Γ_{θ} defined by
is invertible, where \(\nabla \mathbf {f}_{n}(\cdot) \in \mathbb {R}^{M \times M_{n}}\) denotes the gradient of f_{n}(·).
Smoothness conditions on the sensing functions, such as the one imposed by Assumption 3, are common in statistical estimation with nonlinear observations models. Note that the matrix Γ_{θ} is well defined at the true value of the parameter θ as θ∈int(Θ) and the continuous differentiability of the sensing functions holds for all θ∈int(Θ).
The asymptotic properties of the WNLS estimator in terms of consistency and asymptotic normality are characterized by the following classical result:
Proposition 1
([15]) Let the parameter set Θbe compact and the sensing function f_{n}(·) be continuous on Θfor each n. Let \(\mathcal {G}_{t}\) be an increasing sequence of σalgebras such that \(\mathcal {G}_{t} = \sigma \left (\left \{\left \{\mathbf {y}_{n}(s)\right \}_{s=0}^{t1}\right \}_{n=1}^{N}\right)\). Further, denote by θ the true parameter to be estimated. Then, a WNLS estimator of θ exists, i.e., there exists an \(\{\mathcal {G}_{t}\}\)adapted process \(\left \{{\widehat {\boldsymbol {\theta }}}_{t}\right \}\) such that
Moreover, if the model is globally observable, i.e., Assumption 3 holds, the WNLS estimate sequence \(\left \{{\widehat {\boldsymbol {\theta }}}_{t}\right \}\) is consistent, i.e.,
where \(\mathbb {P}_{{\boldsymbol {\theta }}}(\cdot)\) denotes the probability operator. Additionally, if Assumption 4 holds, the parameter estimate sequence is asymptotically normal, i.e.,
where
Γ_{θ} is as given by (8) and \(\overset {\mathcal {D}}{\Longrightarrow }\) refers to convergence in distribution (weak convergence).
The centralized WNLS estimator above suffers from significant communication overhead due to the inherent access to data samples across all agents at all times. Moreover, the minimization in (6) requires batch processing due to the nonsequential nature of the minimization. Recursive centralized estimators utilizing stochastic approximation type approaches have been proposed in [9, 10, 32, 41, 42], which mitigate the batch processing through the development of sequential albeit centralized estimators. However, such recursive estimators still suffer from the enormous communication overhead as the fusion center requires access to the data samples across all agents at all times and the global model information in terms of the sensing functions and the noise statistics across agents.
Preliminaries: distributed WNLS
Sequential distributed recursive schemes conforming to the consensus+innovations (see for example, [19] and Eq. (16) ahead) type update, where the agents’ knowledge of the model is limited to themselves have been proposed in [16, 40]. In [16], so as to achieve the optimal asymptotic covariance, the global model information is made available through a carefully constructed gain matrix update, which adds additional computation complexity and communication cost. In contrast with [16, 40] introduces the trade off in terms of suboptimality of the asymptotic covariance while using local model information at individual agents for evaluating the gain matrix and thus saving communication cost. However, both the aforementioned algorithms in [16, 40] have the number of communication scales linearly with the number of pernode sampled observations {y_{n}(t)}. This paper builds upon the ideas of sequential distributed recursive schemes catering to nonlinear observation models as proposed in [16, 40] to construct a communication efficient scheme without compromising on the performance in terms of the mean square error. That is, we aim to achieve the order optimal MSE decay rate of Θ(1/t) (see, e.g., [9]) in terms of the number of pernode processed samples, while reducing the Θ(t) communication cost which is a characteristic of previous approaches.
Before proceeding further, we briefly summarize the estimator in [40] which is referred to as the benchmark estimator henceforth. The overall update rule at an agent n corresponds to
and
where Ω_{n} is the communication neighborhood of agent n (determined by the Laplacian L); ∇f_{n}(·) is the gradient of f_{n}; \(\mathcal {P}_{\Theta }[\cdot ]\) the projection operator corresponding to projecting on Θ; and {β_{t}} and {α_{t}} are consensus and innovation weight sequences given by
where \(\widehat {\alpha }_{0}, \widehat {\beta }_{0} > 0, 0<\delta _{1}<1/21/(2+\epsilon _{1})\) and ε_{1} was defined in Assumption 1. From the asymptotic normality in Theorem 2 in [40], it can be inferred that the MSE decays as O(1/t).
Communication efficiency
The communication cost \(\mathcal {C}_{t}\) is defined as the expected number of pernode communications up to iteration t. Formally, the communication cost \(\mathcal {C}_{t}\) is given by
where agent n is arbitrary (the expectation in (16) does not depend on n) and \(\mathbb {I}_{A}\) represents the indicator of event A. The communication cost \(\mathcal {C}_{t}\) for both the centralized WNLS estimator (where all agents transmit their samples y_{n}(t) to the fusion center at all times t) and the distributed estimators in [16, 40] is \(\mathcal {C}_{t} = \Theta (t)\), where we note that the iteration count t is equivalent to the number of per node samples collected till time t. Technically speaking, the MSE decays as \(O\left (\frac {1}{\mathcal {C}_{t}}\right)\).
\(\mathcal {CREDONL}\): a communication efficient distributed WNLS estimator
In this section, we present the \(\mathcal {CREDONL}\) estimator. \(\mathcal {CREDONL}\) is based on a carefully chosen protocol which aids in making the communications increasingly probabilistically sparse. Intuitively speaking, the communication protocol exploits the idea that with a gradual information accumulation at the agents through communications, an agent is able to accumulate sufficient information about the parameter of interest which then allows it to drop communications increasingly often. Technically speaking, for each node n, at every time t, we introduce a binary random variable ψ_{n,t}, where
where ψ_{n,t}’s are independent both across time and the nodes, i.e., across t and n, respectively as well are independent from nodes’ observations in (1). The random variable ψ_{n,t} abstracts out the decision of the node n at time t whether to participate in the neighborhood information exchange or not. We specifically take ρ_{t} and ζ_{t} of the form
where 0<ε<1. Furthermore, define β_{t} to be
With the above development in place, we define the random timevarying Laplacian L(t), where \(\mathbf {L}(t)\in \mathbb {R}^{N\times N}\) which abstracts the internode information exchange as follows:
The communication protocol (17)–(20) assumes that the neighboring nodes communicate only when the corresponding communication link is bidirectional. How bidirectional communication links can be enforced in practice is discussed next. Let us first assume that there exists a dedicated reliable bidirectional communication link between any two neighboring nodes. Consider a link between nodes n and l at time t. If ψ_{n,t}=1, node n participates in communication, and it turns on both its transmitting and receiving antennas. If ψ_{n,t}=0, it switches off both its transmitting and receiving antennas. Suppose that ψ_{n,t}=1, and consider two scenarios: (1) ψ_{l,t}=0 and (2) Ψ_{l,t}=1. Consider first the former case. Since node n listens the dedicated channel to node l and node l does not transmit, node n verifies that it does not receive the respective message from node l (e.g., within a prescribed time window), and hence, it does not incorporate node l’s estimate in its update. Also, as Ψ_{l,t}=0, node l does not include the estimate by node n, by algorithm construction. Next, consider the case Ψ_{l,t}=1. In this case, node n listens the channel and receives the message by node l, and thus, it incorporates node l’s estimate in its update. Completely symmetrically, node l listens the channel from node n to node l, receives the respective message, and includes node n’s estimate in its update. Overall, the preceding discussion explains how the symmetric communication protocol can be established. A very similar consideration can be derived if the links are unreliable but still symmetric, in the sense that if the link from n to l is strong enough to support communication, then so is the link from l to n. Finally, if the physical links can fail in an asymmetric fashion, then the proposed algorithm (see ahead (26)–(28) cannot be implemented in its direct form. More precisely, asymmetric failing links yield the Laplacian matrices L(t) become nonsymmetric. The algorithm (26)–(28) and the corresponding analysis need to change in such scenario. This lies outside the scope of this paper, but it corresponds to an interesting future research direction.
With the protocol described in (17)–(20), both the weight assigned to the links and the probability of the existence of a link decay over time. We next consider the first moment, the second moment, and the variance of the Laplacian entries for {i,j}∈E:
For future reference, we also introduce the mean Laplacian matrix {L(t)} as \(\overline {\mathbf {L}}(t) = \mathbb {E}\left [\mathbf {L}(t)\right ]\), and \(\widetilde {\mathbf {L}}(t) = \mathbf {L}(t)\overline {\mathbf {L}}(t)\). Thus, it holds that \(\mathbb {E}\left [\widetilde {\mathbf {L}}(t)\right ] = \mathbf {0}\), and
where ∥·∥ denotes the L_{2} norm. Inequality (23) can be obtained as follows. First, we have that \(\left \\widetilde {\mathbf {L}}(t)\right \ \leq \left \\widetilde {\mathbf {L}}(t)\right \_{F},\) where ∥.∥_{F} denotes the Frobenius norm. Also, note that
Taking expectation and using (17), inequality (23) follows.
Next, we also have that, \(\overline {\mathbf {L}}(t)=\beta _{t}\overline {\mathbf {L}}\), where
We next give an assumption on the connectivity of the interagent communication graph.
Assumption 5
The interagent communication graph is connected on average, i.e., \(\lambda _{2}(\overline {\mathbf {L}}) > 0\), which implies \(\lambda _{2}(\overline {\mathbf {L}}(t))>0\), where \(\overline {\mathbf {L}}(t)\) denotes the mean of the Laplacian matrix L(t) and λ_{2}(·) denotes the second smallest eigenvalue.
Assumption 5 ensures consistent information flow among the agent nodes. Technically speaking, the communication graph modeled here as a random undirected graph need not be connected at all times. It is to be noted that Assumption 3 ensures that \(\overline {\mathbf {L}}(t)\) is connected at all times as \(\overline {\mathbf {L}}(t)=\beta _{t}\overline {\mathbf {L}}\). We now state additional assumption on the smoothness of the sensing functions for the distributed setup.
Assumption 6
For each n, the sensing function f_{n}(·) is Lipschitz continuous on Θ, i.e., for each agent n, there exists a constant k_{n}>0 such that
for all θ,θ∈Θ.
With the communication protocol established, we propose an update, where every node n generates an estimate sequence {x_{n}(t)}, where \(\mathbf {x}_{n}(t)\in \mathbb {R}^{M}\) in the following way:
and
where Ω_{n} denotes the neighborhood of node n with respect to the network represented by \(\overline {\mathbf {L}}\), α_{t} is the innovation gain sequence which is given by α_{t}=α_{0}/(t+1), α_{0}>0, and \(\mathcal {P}_{\Theta }[\cdot ]\) the projection operator corresponding to projecting on Θ. The random variable ψ_{n,t} determines the activation state of a node n. By activation, we mean, if ψ_{n,t}≠0, then node n can send and receive information in its neighborhood at time t. However, when ψ_{n,t}=0, node n neither transmits nor receives information. The link between node n and node l gets assigned a weight of \(\rho _{t}^{2}\) if and only if ψ_{n,t}≠0 and ψ_{l,t}≠0.
The update in (26) can be written in a compact manner as follows:
Here, ⊗ denotes the Kronecker product, I_{M} denotes the M×M identity matrix, and:
Remark 1
The Laplacian sequence that plays a role in the analysis in this paper, takes the form \(L(t)=\beta _{t}\overline {L}+\widetilde {L}(t)\), where \(\widetilde {L}(t)\) the residual Laplacian sequence does not scale with β_{t} owing to the fact that the communication rate is chosen adaptively and thus makes the Laplacian matrix sequence not identically distributed.
We refer to the parameter estimate update in (26) and the projection in (27) in conjunction with the randomized communication protocol as the \(\mathcal {CREDONL}\) algorithm. We propose a condition on the sensing functions (standard in the literature of general recursive procedures) that guarantees the existence of stochastic Lyapunov functions and, hence, the convergence of the distributed estimation procedure.
Assumption 7
The following aggregate strict monotonicity condition holds: there exists a constant c_{1}>0 such that for each pair \({\boldsymbol {\theta }}, \acute {{\boldsymbol {\theta }}}\) in Θ we have that
The instrumental step in analyzing the convergence of the proposed algorithm is ensuring the existence of appropriate stochastic Lyapunov functions (see, for example [16–20]) which is in turn guaranteed by Assumption 7.
Remark 2
It is to be noted that the Assumptions 6–7 are only sufficient conditions. Moreover, the assumptions which play a key role in establishing the main results, i.e., Assumptions 2, 1, 6, and 7 are required to hold only in the parameter set Θ instead of the entire space \(\mathbb {R}^{M}\), which makes our algorithm to apply to very general nonlinear sensing functions.
We consider a specific example to give more intuition about the assumptions in this paper. If the f_{n}(·)’s are linear, i.e., f_{n}(θ)=F_{n}θ, where F_{n} is the sensing matrix with dimensions M_{n}×M, Assumption 3 becomes equivalent to \(\sum _{n=1}^{N}\mathbf {F}_{n}^{\top }\mathbf {R}_{n}^{1}\mathbf {F}_{n}\) being full rank.^{Footnote 4} Under this context, the monotonicity condition in Assumption 7 is trivially satisfied by the positive definiteness of the matrix \(\sum _{n=1}^{N}\mathbf {F}_{n}^{\top }\mathbf {R}_{n}^{1}\mathbf {F}_{n}\). We formalize an assumption on the innovation gain sequence {α_{t}} before proceeding further.
Assumption 8
We require that α_{0} satisfies
where c_{1} is defined in Assumption 7 and α_{0} is the innovation gain at t=0.
The communication cost per node for the proposed algorithm is given by \(\mathcal {C}_{t} = \sum _{s=0}^{t1}\zeta _{s} = \Theta \left (t^{(1+\epsilon)/2}\right)\), which in turn is strictly sublinear as ε<1.
Main results
In this section, we present the main results of the proposed algorithm \(\mathcal {CREDONL}\), while the proofs of the main results are relegated to Section 7. The first result concerns with the consistency of the estimate sequence {x_{n}(t)}.
Theorem 4.1
Let Assumptions 1–3 and 5–8 hold. Consider the sequence {x_{n}(t)} generated by algorithm (26)–(27) at each agent n, with the parameters set to \(\rho _{t} = \frac {\rho _{0}}{(t+1)^{\epsilon /2}},\)\(\zeta _{t} = \frac {\zeta _{0}}{(t+1)^{(1/2\epsilon /2)}},\) and α_{t}=α_{0}/(t+1), where ρ_{0},ζ_{0},α_{0} are arbitrary positive numbers. Then, for each n, we have
Theorem 4.1 verifies that the estimate sequence generated by \(\mathcal {CREDONL}\) at any agent n is strongly consistent, i.e., x_{n}(t)→θ almost surely (a.s.) as t→∞. While Assumption 4 is needed for asymptotic normality results as in Proposition 1, it is not necessary for Theorem 4.1 (nor Theorem 4.1 ahead) to hold.
We now state a main result of this paper which establishes the MSE communication rate for the proposed algorithm \(\mathcal {CREDONL}\).
Theorem 4.2
Let the hypothesis of Theorem 4.1 hold. Then, we have, for each n,
Furthermore, for each n, we have:
where 0<ε<1 and is as defined in (18).
We make several remarks on Theorems 4.1 and 4.2.
Remark 3
Note that ε in Theorem 4.2 can be taken to be arbitrarily small. Hence, \(\mathcal {CREDONL}\) achieves MSE communication rate arbitrarily close to \(1/\mathcal {C}_{t}^{2}\). This is a significant improvement over existing nonlinear distributed consensus + innovations estimation methods, e.g., [18, 20]. They have O(t) communication cost up to time t and a MSE iterationwise rate of O(1/t), hence achieving \(O(1/\mathcal {C}_{t})\) MSE communication rates. \(\mathcal {CREDONL}\) achieves the orderoptimal O(1/t) MSE iterationwise rate with a reduced communication cost, thus significantly improving the MSE communication rate.
Remark 4
Observe that \(\mathcal {CREDONL}\) algorithm, with β_{t}=β_{0} (t+1)^{−1} has communication cost of \(\mathcal {C}_{t} = \Theta \left (t^{0.5(1+\epsilon)}\right)\). From this, we can see that MSE as a function of \(\mathcal {C}_{t}\) is given by \(\text {MSE} = O\left (\mathcal {C}_{t}^{2/(1+\epsilon)}\right)\).
Of course, with β_{t} that decays faster than 1/t, communication cost reduces further. However, it can be shown that in this case the algorithm no longer produces good estimates. Namely, from standard arguments in stochastic approximation, it can be shown that for β_{t}=β_{0} (t+1)^{−1−δ}, with δ>0, \(\mathcal {CREDONL}\)’s estimate sequence may not converge to θ.
Remark 5
The \(\mathcal {CREDONL}\) algorithm builds on our prior work in [37, 38, 40], but establishing Theorems 4.1–4.2 incurs several technical challenges with respect to our past work. Namely, from a technical standpoint, the \(\mathcal {CIWNLS}\) algorithm in [40] incurs the challenge of nonlinear observation models. On the other hand, \(\mathcal {CREDO}\) in [37, 38] incurs the challenge of increasingly sparse communications. Differently from \(\mathcal {CREDO}\) and \(\mathcal {CIWNLS}\), this paper simultaneously accounts for both of these challenges. This makes mean square and asymptotic normality analysis more challenging. As a consequence of this difference, while for \(\mathcal {CIWNLS}\) and \(\mathcal {CREDO}\) we establish both MSE iterationwise convergence rate analysis and asymptotic normality, here we establish only the MSE (iterationwise and communicationwise) convergence rate results. Next, \(\mathcal {CREDONL}\) is a single time scale stochastic approximationtype algorithm, while both \(\mathcal {CIWNLS}\) and \(\mathcal {CREDO}\) are two time scale algorithms. Further, the consensus potentials in \(\mathcal {CIWNLS}\) and in \(\mathcal {CREDONL}\) are the same only on average, i.e., up to the first moment. The difference in higher order moments corresponds to different analyses, namely, the randomized communication protocol that incurs with \(\mathcal {CREDONL}\), an increased upper bound of the iterationwise estimate of MSE. A careful analysis in this paper shows that the additional terms in the MSE bounds with \(\mathcal {CREDONL}\) decay faster with time t than 1/t, and hence, the MSE iterationwise rate remains orderoptimal and equal to 1/t (see the proof of Theorems 4.1 and 4.2 in Appendix A.) Finally, we point out that the differences of Theorem 4.1 with respect to works [37, 38] mainly arise from the fact that we consider here nonlinear observation models. Due to this difference, several terms that appear in MSE upper bounds are bounded in a technically different way—see the proof of Lemma A1 in Appendix A. Therein, we need to use the arguments like the nonexpansiveness property of projections and Lipschitz continuity of functions f_{n}, none of which is explicitly used in [37, 38].
Simulation experiments
This section corroborates our theoretical findings through simulation examples and demonstrates the communication efficiency of \(\mathcal {CREDONL}\).
Specifically, we compare the proposed communication efficient distributed estimator, \(\mathcal {CREDO}\), with the benchmark distributed recursive estimator in (13) and the diffusion algorithm as in [43]^{Footnote 5}, which both utilize all interneighbor communications at all times, i.e., they have a linear communication cost. The example demonstrates that the proposed communication efficient estimator has a similar MSE iterationwise rate as the two benchmark estimators. The simulation also shows that the proposed estimator improves the MSE communication rate with respect to the two benchmarks.
We generate a random geometric network of 10 agents, shown in Fig. 1.
The relative degree^{Footnote 6} of the graph is equal to 0.4. The graph was generated as a connected instance of the geometric graph model with radius \(r=\sqrt {\text {ln}(N)/N}\). To be specific, the first step involves generating 10 points in a unit square grid and the nodes are connected with a link if the distance between them is less than \(\sqrt {\text {ln}(N)/N}\). We repeat the procedure until we get a connected graph instance. We choose the parameter set Θ to be \(\Theta =\left [\frac {\pi }{4}, \frac {\pi }{4}\right ]^{7}\in \mathbb {R}^{7}\). This choice of Θ conforms with Assumption 2. The sensing functions are chosen to be certain trigonometric functions as described below. The underlying parameter is set as θ=[θ_{1}, θ_{2}, θ_{3}, θ_{4}, θ_{5}, θ_{6}, θ_{6}] and thus \({\boldsymbol {\theta }}\in \mathbb {R}^{7}\). The sensing functions at the agents are taken to be, f_{1}(θ)= sin(θ_{1}+θ_{2}+θ_{3}),f_{2}(θ)= sin(θ_{3}+θ_{2}+θ_{4}),f_{3}(θ)= sin(θ_{3}+θ_{4}+θ_{5}),f_{4}(θ)= sin(θ_{4}+θ_{5}+θ_{6}),f_{5}(θ)= sin(θ_{6}+θ_{5}+θ_{7}),f_{6}(θ)= sin(θ_{6}+θ_{7}+θ_{1}),f_{7}(θ)= sin(θ_{1}+θ_{2}+θ_{7}),f_{8}(θ)= sin(θ_{1}+θ_{2}+θ_{4}),f_{9}(θ)= sin(θ_{2}+θ_{3}+θ_{6}) and f_{10}(θ)= sin(θ_{3}+θ_{4}+θ_{6}). Thus, it is to be noted that each node makes a scalar observation at time t. The noises γ_{n}(t) are Gaussian and are i.i.d. both in time and across nodes and have the covariance matrix equal to 0.25×I_{10}. The local sensing functions render the parameter θ locally unobservable, but the parameter θ is globally observable as, under the parameter set Θ considered in this setup, sin(·) is onetoone and the set of linear combinations of the θ components corresponding to the arguments of the sin(·)’s constitute a fullrank system for θ. Hence, the global observability requirement specified by Assumption 3 is satisfied. The unknown but deterministic value of the parameter is taken to be θ=[π/6, −π/7, π/12, −π/5, π/16,7π/36,π/10]. Under the model considered here in terms of the sensing functions as specified above and the parameter set \(\Theta =\left [\frac {\pi }{4}, \frac {\pi }{4}\right ]^{7}\), it can be easily verified that the model conforms to the conditions specified in Assumptions 3–7. The projection operator \(\mathcal {P}_{\Theta }\) onto the set Θ defined in (14) is given by,
for all i=1,⋯,M.
The parameters of the two benchmarks and of the proposed estimator are as follows. The benchmark estimator in (13) has the consensus weight set to 0.48(t+1)^{−1}. For the proposed estimator, we set ρ_{t}=0.45(t+1)^{−0.01} and ζ_{t}=(t+1)^{−0.49}. The step size sequence for the benchmark estimator proposed in [43] is set to μ_{t}=(0.3(t+20))^{−1}.
It is to be noted that the Laplacian matrix considered for the benchmark estimator and the expected Laplacian matrix for the proposed estimator, \(\mathcal {CREDONL}\) are equal, i.e., \(\mathbf {\overline {L}} =\mathbf {L}\). The innovation weight is set to α_{t}=(0.3(t+20))^{−1}. It is to be noted that with the time shifted innovation potential, the theoretical results in this paper continue to hold. As a performance metric, we use the relative MSE estimate averaged across nodes:
further averaged across 100 independent runs of the estimators. In the above equation, x_{n}(0) refers to the initial estimates at each node, which is set as x_{n}(0)=0. Figure 2 plots the relative MSE decay in terms of the number of iterations or the number of samples. It can be seen that the MSE decay of the two benchmark estimators and the MSE decay of the proposed estimator \(\mathcal {CREDONL}\) are very similar with respect to the iteration count. Figure 3 plots the MSE decay of the three estimators in terms of the communication cost per node. It can be seen for example that, at a relative MSE level of 10^{−1}, the proposed estimator requires 20 and 18 times less communications as compared to the estimator in (13) and the algorithm in [43]. One can also notice a faster MSE decay in terms of the communication cost for \(\mathcal {CREDONL}\) as compared to the benchmark (13), thus confirming our theory.
Discussion
In the context of existing work on nonlinear distributed methods, e.g., [15, 16, 31, 34–36, 40], the current paper contributes by developing a method with a strictly faster communication rate of \(O(1/\mathcal {C}_{t}^{2\zeta })\)(ζ>0 arbitrarily small) with respect to existing \(O(1/\mathcal {C}_{t})\) rates. Further, with respect to existing works that develop methods designed to achieve communication efficiency, e.g., [13, 21, 44–46], we develop here a different scheme with randomized increasingly sparse communications. Finally, this paper is a continuation of works [37, 38] but, in contrast with [37, 38], it considers nonlinear observation models. This requires novel analysis techniques as detailed in Section 1. It would be interesting to apply the proposed method on real data sets, e.g., in the context of IoT or power systems applications, in addition to synthetic data tests considered here.
Conclusions
In this paper, we have proposed \(\mathcal {CREDONL}\)—a communicationefficient distributed estimation scheme for nonlinear observation models. We established strong consistency of the estimate sequence at each agent and characterized the MSE decay in terms of the peragent communication cost \(\mathcal {C}_{t}\). \(\mathcal {CREDONL}\) achieves the MSE decay rate \(O\left (\mathcal {C}_{t}^{2+\zeta }\right)\), where ζ>0 and ζ is arbitrarily small. Future research directions include extending the proposed algorithm to a mixedtime scale stochastic approximation type algorithm, so as to achieve an asymptotic covariance independent of the network, as well as to extend the presented ideas to distributed stochastic optimization.
Appendix A: Proof of Main Results
We present the proofs of main results in this section.
Proof of Theorem 4.1
We start the proof with the following useful Lemma. □
Lemma 1
For each n, the process {x_{n}(t)} satisfies
Proof
Consider (14). Since the projection is onto a compact convex set, it is nonexpansive. It follows that the inequality
holds for all n and t. We first note that,
where \(\mathbb {E}\left [\widetilde {\mathbf {L}}(t)\right ] = \mathbf {0}\) and \(\mathbb {E}\left [\widetilde {\mathbf {L}}_{i,j}^{2}(t)\right ] = \frac {\rho _{0}^{2}\beta _{0}}{(t+1)^{1+\epsilon }}  \frac {\beta _{0}^{2}}{(t+1)^{2}}\), for {i,j}∈E,i≠j.
Define, z(t)=x(t)−1_{N}⊗θ and V(t)=∥z(t)∥^{2}. (Here, 1_{N} is the allones N by 1 vector.) Note that z(t) corresponds to the estimation error vector at time t; its squared norm V(t) will first serve us as a Lyapunov function to establish the almost sure boundedness of x(t) as in Lemma A1. Let \(\{\mathcal {F}_{t}\}\) be the natural filtration generated by the random observations and the random Laplacians i.e.,
Now, consider the update rules (26)–(28). By algebraic manipulations, conditional independence, and utilizing (36), we have that,
Consider the orthogonal decomposition
where z_{C} denotes the projection of z to the consensus subspace \(\mathcal {C}=\left \{\mathbf {z} \in \mathbb {R}^{MN} \mathbf {z}=\mathbf {1}_{N}\otimes \mathbf {a}, \text {for\ some\ a} \in \mathbb {R}^{M} \right \}\). The following inequalities hold for all t≥t_{1}, where t_{1} is a sufficiently large positive integer:
Here, we recall that \(\lambda _{N}(\mathbf {\overline {L}})\) is the largest eigenvalue of matrix \(\mathbf {\overline {L}}.\) Further, c_{1} is defined in Assumption 7, and c_{2},c_{5} are appropriately chosen positive constants. Here, z_{C⊥}(t)=z(t)−z_{C}(t), where z_{C}(t) is the projection of z(t) on the consensus subspace \(\mathcal C\). Inequality (q0) holds because, as noted above, there holds that \(\mathbb {E}\left [\widetilde {\mathbf {L}}_{i,j}^{2}(t)\right ] \leq \frac {\rho _{0}^{2}\beta _{0}}{(t+1)^{1+\epsilon }} \), for {i,j}∈E,i≠j. Specifically, constant c_{5} can be taken to equal \(2 \,N^{3}\,\rho _{0}^{2}\beta _{0}\). Next, inequalities (q1) and (q3) follow from the properties of the Laplacian. Inequality (q2) follows from Assumption 7, and (q4) follows from Assumption 6 since we have that ∥∇f_{n}(x_{n}(t))∥ is uniformly bounded from above by k_{n} for all n, and hence, we have that ∥G(x(t))∥≤ maxn=1,⋯,Nk_{n}. (Recall quantity G(x(t)) defined before Remark 3.1.) That is, c_{2} can be taken as \((\max _{n=1,\cdots,N}k_{n})^{2} (\max _{n=1,\cdots,N}\\mathbf {R}_{n}^{1}\) \\overline {\mathbf {L}}\\). We also have
for some constant c_{4}>0. In (42), we use the fact that the noise process under consideration has finite covariance. We also use the fact that, almost surely, ∥G(x(t))∥≤ maxn=1,⋯,Nk_{n}, which in turn follows from Assumption 6. In particular, c_{4} may be taken as \((\max _{n=1,\cdots,N}k_{n})^{2} (\max _{n=1,\cdots,N}\\mathbf {R}_{n}^{1}\)^{2} (\max _{n=1,\cdots,N}\\mathbf {R}_{n}\)^{2} \). We further have that,
where c_{3}>0 is a constant. It is to be noted that (43) follows from the Lipschitz continuity in Assumption 6 and the result that ∥G(x(t))∥≤ maxn=1,⋯,Nk_{n}. That is, c_{3} may be taken as \((\max _{n=1,\cdots,N}k_{n})^{4} (\max _{n=1,\cdots,N}\\mathbf {R}_{n}^{1}\)^{2}\). Applying the bounds (41)–(43) in (39), we obtain, after some algebraic manipulations,
where c_{6},c_{8},c_{9} are appropriately chosen positive constants, and c_{5} is as in (41). In particular, c_{6} may be taken as c_{6}=c_{4}; c_{8} may be taken as \(\beta _{0}^{2}\,(\lambda _{N}(\overline {\mathbf {L}}))^{2} /\alpha _{0}^{2} + 2\beta _{0}\sqrt {c_{3}}+c_{3}\), and c_{9} may be taken as \(2\, \lambda _{2}(\overline {\mathbf {L}})\).
As \(\frac {c_{5}}{(t+1)^{1+\epsilon }}\) goes to zero faster than β_{t}, ∃t_{2} such that ∀t≥t_{2}, \(\beta _{t} \ge \frac {c_{5}}{(t+1)^{1+\epsilon }}\). By the above construction we obtain ∀t≥t_{2},
where \(\widehat {\alpha }(t) = \sqrt {c_{6}}\alpha _{t}\). The product \(\prod _{s=t}^{\infty }(1+\alpha _{s}^{2})\) exists for all t. Now, let {W(t)} be such that
By (46), it can be shown that {W(t)} satisfies,
Hence, {W(t)} is a nonnegative supermartingale and converges a.s. to a bounded random variable W^{∗} as t→∞. It then follows from (46) that V(t)→W^{∗} as t→∞. Thus, we conclude that the desired claim holds. □
The following Lemma will play a key role in establishing the convergence of the estimate sequence.
Lemma 2
(Lemma 4.1 in [18]) Consider the scalar timevarying linear system
where {r_{1}(t)} is a sequence, such that
with a_{1}>0,0≤δ_{1}<1, whereas the sequence {r_{2}(t)} is given by
with a_{2}>0,δ_{2}≥0. Then, if u(0)≥0 and δ_{1}<δ_{2}, we have
for all 0≤δ_{0}<δ_{2}−δ_{1}. Also, if δ_{1}=δ_{2}, then the sequence {u(t)} stays bounded, i.e. supt≥0∥u(t)∥<∞.
We now prove the almost sure convergence of the estimate sequence to the true parameter. Following similar steps as in the proof of Lemma 1, for t large enough
as for t large enough, \(2c_{1}\alpha _{t}+c_{7}\alpha ^{2}_{t}<0\). Here, constant c_{6} is as in (44), and c_{7} is appropriately chosen positive constant that may be taken as \(\beta _{0}^{2}\,(\lambda _{N}(\overline {\mathbf {L}}))^{2} /\alpha _{0}^{2} + 2\beta _{0}\sqrt {c_{3}}+c_{3}\). Now, consider the \(\{\mathcal {F}_{t}\}\)adapted process {V_{1}(t)} defined as follows
Since {(t+1)^{−2}} is summable, the process {V_{1}(t)} is bounded from above. Moreover, it also follows that \(\phantom {\dot {i}\!}\{V_{1}(t)\}_{t\geq t_{1}}\) is a supermartingale and hence converges a.s. to a finite random variable. By definition from (53), we also have that {V(t)} converges to a nonnegative finite random variable V^{∗}. Finally, from (52), we have that,
for t large enough. The sequence {V(t)} then falls under the purview of Lemma 3 ahead, and we have \({\mathbb {E}}[\!V(t)]\to 0\) as t→∞. Finally, by Fatou’s Lemma, where we use the nonnegativity of the sequence {V(t)}, we conclude that
which thus implies that V^{∗}=0 a.s. Hence, ∥z(t)∥→0 a.s. as t→∞, and the desired assertion follows.
We will use the following approximation result (Lemma 3) and the generalized convergence criterion (Lemma 4) for the proof of Theorem 2. Lemma 3 is an extension of Lemma 5 in [18]. Lemma 4 is Lemma 10 in [8].
Lemma 3
Let {b_{t}} be a scalar sequence satisfying
where d>0 and c>1. Then, we have,
Lemma 4
Let {J(t)} be an \(\mathbb {R}\)valued \(\{\mathcal {F}_{t+1}\}\)adapted process such that \(\mathbb {E}\left [J(t)\mathcal {F}_{t}\right ]=0\) a.s. for each t≥1. Then the sum \(\sum _{t\geq 0}J(t)\) exists and is finite a.s. on the set where \(\sum _{t\geq 0}\mathbb {E}\left [J(t)^{2}\mathcal {F}_{t}\right ]\) is finite.
Proof of Theorem 4.2
Consider inequality (54), and recall that, by Assumption 8, we have that α_{0} c_{1}>1. We can now see that the sequence {V(t)} then falls under the purview of Lemma 3, and we have
Inequality (58) now clearly implies that, for each agent n, there holds:
The communication cost \(\mathcal {C}_{t}\) for the proposed \(\mathcal {CREDONL}\) algorithm is given by \(\mathcal {C}_{t} = \Theta \left (t^{\frac {\epsilon +1}{2}}\right)\), and thus the assertion follows in conjunction with (59). □
Notes
 1.
From now on, in order to better distinguish the MSE rate of decay with respect to the number of iterations t and with respect to the number of pernode communications, we will refer to the former as the MSE iterationwise rate and to the latter as the MSE communication rate.
 2.
The stronger requirement imposed here, with ε_{1} being strictly positive, is only required for the benchmark estimator in Eqs. (13)(14) ahead to be defined properly; the reason for this requirement is the two time scale nature of the benchmark estimator (13)(14). As the proposed \(\mathcal {CREDONL}\) estimator is single time scale, ε_{1} can be taken to be zero, and the main results (Theorems 1 and 2 ahead) continue to hold.
 3.
To see this, note that the dependence of the measurements on the state is through sinusoidal functions (see Eq. (4)), which are everywhere differentiable and thus the gradient of f_{n}(·) within the domain Θ exists everywhere. Moreover, as the derivatives of sin(·) and cos(·) are bounded, the norm of gradient of f_{n}(·) is bounded. Finally, regarding Assumption 3, it can be shown that the assumption is satisfied if (1) graph G is connected; (2) the set of admissible phase angle values, i.e., the parameter constraint set Θ, is chosen appropriately; (3) the real power flow between nodes n and l is nonzero if and only if there exists a physical transmission line connecting the nodes; and (4) voltage magnitude \(\mathcal {V}_{n} \neq 0\), for all nodes n. Please see Proposition 27 in [19].
 4.
To see why this is true, consider for simplicity the case R_{n}=I, for all n. Then, there holds: \(\sum _{n=1}^{N} \\mathbf {f}_{n}({\boldsymbol {\theta }})  \mathbf {f}_{n}({\boldsymbol {\theta }}^{\prime })\^{2} = ({\boldsymbol {\theta }}{\boldsymbol {\theta }}^{\prime })^{\top } \left (\sum _{n=1}^{N} \mathbf {F}_{n}^{\top } \mathbf {F}_{n} \right) ({\boldsymbol {\theta }}{\boldsymbol {\theta }}^{\prime }).\) Now, the statement of Assumption 3 becomes the following: the matrices F_{n}, n=1,...,N, are such that there holds: \(({\boldsymbol {\theta }}{\boldsymbol {\theta }}^{\prime })^{\top } \left (\sum _{n=1}^{N} \mathbf {F}_{n}^{\top } \mathbf {F}_{n} \right) ({\boldsymbol {\theta }}{\boldsymbol {\theta }}^{\prime })=0 \) if and only if θ−θ^{′}. But this is equivalent to requiring that \(\sum _{n=1}^{N} \mathbf {F}_{n}^{\top } \mathbf {F}_{n} \) is full rank.
 5.
Applied to our setting and in our notation, the diffusion method as in [43] takes the following form:
$$\begin{array}{*{20}l} &{}\mathbf{x}^{\prime}_{n}(t+1)\,=\,\mathbf{x}_{n}(t)\,\,{\mu_{t}\left(\nabla \mathbf{f}_{n}(\mathbf{x}_{n}(t))\right)\mathbf{R}_{n}^{1}\left(\mathbf{f}_{n}(\mathbf{x}_{n}(t))\mathbf{y}_{n}(t)\right)}\\ &{}\mathbf{x}_{n}(t+1) \,=\, \sum_{l \in \Omega_{n} \cup \{n\}} a_{ln}\,\mathbf{x}^{\prime}_{l}(t+1). \end{array} $$Here, x_{n}(t) is the solution estimate at agent n, \(\mathbf {x}^{\prime }_{n}(t)\) is an auxiliary sequence at agent n, μ_{t} is the stepsize, and the a_{ln}’s are combination weights that constitute together a N×N columnstochastic matrix.
 6.
Relative degree is the ratio of the number of links in the graph to the number of possible links in the graph.
Abbreviations
 CPS:

Cyberphysical systems
 \(\mathcal {CREDO}\) :

Communication efficient REcursive Distributed estimatOr
 \(\mathcal {CREDONL}\) :

\(\mathcal {CREDO}\)nonlinear
 i.i.d.:

Independent identically distributed
 IoT:

Internet of things
References
 1
A. Abur, A. G. Exposito, Power System State Estimation: Theory and Implementation (Marcel Dekker, New York, 2004).
 2
D. Bajović, J. M. F. Moura, J. Xavier, B. Sinopoli, Distributed inference over directed networks: performance limits and optimal design. IEEE Trans. Sig. Process. 64(13), 3308–3323 (2016).
 3
B. Bollobas, Modern Graph Theory (Springer Verlag, New York, 1998).
 4
P. Braca, S. Marano, V. Matta, Enforcing consensus while monitoring the environment in wireless sensor networks. IEEE Trans. Sig. Process. 56(7), 3375–3380 (2008).
 5
F. Cattivelli, A. H. Sayed, Diffusion LMS strategies for distributed estimation. IEEE Trans. Sig. Process. 58(3), 1035–1048 (2010).
 6
J. Chen, C. Richard, A. H. Sayed, Multitask diffusion adaptation over networks. IEEE Trans. Sig. Process. 62(16), 4129–4144 (2014).
 7
F. R. K. Chung, Spectral graph theory, vol. 92 (American Mathematical Soc., Providence, 1997).
 8
L. E. Dubins, D. A. Freedman, A sharper form of the BorelCantelli lemma and the strong law. Ann. Math. Stat. 36(3), 800–807 (1965).
 9
V. Fabian, On asymptotically efficient recursive estimation. Ann. Stat. 6(4), 854–866 (1978).
 10
R. Z. Has’minskij, in Proc. Prague Symp. Asymptotic Statist.Sequential estimation and recursive asymptotically optimal procedures of estimation and observation control, vol. 1 (Charles Univ.Prague, 1974), pp. 157–178.
 11
C. Heinze, B. McWilliams, N. Meinshausen, in 19th International Conference on Artificial Intelligence and Statistics. Dualloco: distributing statistical estimation using random projections (Cadiz, 2016), pp. 875–883.
 12
M. D. Ilic’, J. Zaborszky, Dynamics and Control of Large Electric Power Systems (Wiley, New York, 2000).
 13
D. Jakovetic, D. Bajovic, N. Krejic, N. Krklec Jerinkic, Distributed gradient methods with variable number of working nodes. IEEE Trans. Sig. Process. 64(15), 4080–4095 (2016).
 14
D. Jakovetic, J. Xavier, J. M. F. Moura, Cooperative convex optimization in networked systems: augmented Lagrangian algorithms with directed gossip communication. IEEE Trans. Sig. Process. 59(8), 3889–3902 (2011).
 15
R. I. Jennrich, Asymptotic properties of nonlinear least squares estimators. Ann. Math. Stat. 40(2), 633–643 (1969).
 16
S. Kar, J. M. F. Moura, Asymptotically efficient distributed estimation with exponential family statistics. IEEE Trans. Inf. Theory. 60(8), 4811–4831 (2014).
 17
S. Kar, J. M. F. Moura, H. V. Poor, Distributed linear parameter estimation: asymptotically efficient adaptive strategies. SIAM J. Control Optim. 51(3), 2200–2229 (2013).
 18
S. Kar, J. M. F. Moura, H. V. Poor, QDLearning: A Collaborative Distributed Strategy for MultiAgent Reinforcement Learning Through Consensus + Innovations. IEEE Trans. Signal Process. 61(7), 1848–1862 (2013).
 19
S. Kar, J. M. F. Moura, K. Ramanan, Distributed parameter estimation in sensor networks: nonlinear observation models and imperfect communication. IEEE Trans. Inf. Theory. 58(6), 3575–3605 (2012).
 20
S. Kar, J. M. F. Moura, Convergence rate analysis of distributed gossip (linear parameter) estimation: fundamental limits and tradeoffs. IEEE J. Sel. Top. Sig. Process. 5(4), 674–690 (2011).
 21
G. Lan, S. Lee, Y. Zhou, Communicationefficient algorithms for decentralized and stochastic optimization. arXiv preprint arXiv:1701.03961 (2017).
 22
J. Li, A. H. Sayed, Modeling bee swarming behavior through diffusion adaptation with asymmetric information sharing. EURASIP J. Adv. Sig. Process. 18(1), 2012.
 23
Q. Liu, A. T. Ihler, in NIPS’14 Proceedings of the 27th International Conference on Neural Information Processing Systems  Volume 1. Distributed estimation, information loss and exponential families (MIT PressCambridge, 2014), pp. 1098–1106.
 24
C. G. Lopes, A. H. Sayed, Diffusion leastmean squares over adaptive networks: formulation and performance analysis. IEEE Trans. Sig. Process. 56(7), 3122–3136 (2008).
 25
P. D. Lorenzo, A. H. Sayed, Sparse distributed learning based on diffusion adaptation. IEEE Trans. Sig. Process. 61(6), 1419–1433 (2013).
 26
C. Ma, M. Takáč, Partitioning data on features or samples in communicationefficient distributed optimization?arXiv preprint arXiv:1510.06688 (2015).
 27
C. Ma, V. Smith, M. Jaggi, M. Jordan, P. Richtarik, M. Takac, in ICML’15 Proceedings of the 32nd International Conference on International Conference on Machine Learning  Volume 37. Adding vs. averaging in distributed primaldual optimization (Lille, 2015), pp. 1973–1982.
 28
G. Mateos, G. B. Giannakis, Distributed recursive leastsquares: stability and performance analysis. IEEE Trans. Sig. Process. 60(7), 3740–3754 (2012).
 29
G. Mateos, I. Schizas, G. B. Giannakis, Performance analysis of the consensusbased distributed LMS algorithm. EURASIP J. Adv. Sig. Process. 68:, 2009 (2009).
 30
A. Nedić, A. Olshevsky, C Uribe, in 2015 American Control Conference (ACC). Nonasymptotic convergence rates for cooperative learning over timevarying directed graphs (IEEEChicago, 2015). https://doi.org/10.1109/ACC.2015.7172262.
 31
A. Nedic, A. Ozdaglar, Distributed subgradient methods for multiagent optimization. IEEE Trans. Autom. Control. 54(1), 48–61 (2009).
 32
J. Pfanzagl, in Proceedings of the Prague Symposium on Asymptotic Statistics, ed. by J Hajek. Asymptotic optimum estimation and test procedures, vol. 1 (Charles UniversityPrague, 1974), pp. 201–272.
 33
A. Primadianto, C. N. Lu, A review on distribution system state estimation. IEEE Trans. Power Syst. 32(5), 3875–3883 (2017).
 34
S. S. Ram, A. Nedic, V. V. Veeravalli, Incremental stochastic subgradient algorithms for convex optimization. SIAM J. Optim. 20(2), 691–717 (2009).
 35
S. S. Ram, A. Nedić, V. V. Veeravalli, Distributed stochastic subgradient projection algorithms for convex optimization. J. Optim. Theory Appl. 147(3), 516–545 (2010).
 36
S. S. Ram, V. V. Veeravalli, A Nedic, Distributed and recursive parameter estimation in parametrized linear statespace models. IEEE Trans. Autom. Control. 55(2), 488–492 (2010).
 37
A. K. Sahu, D. Jakovetic, S Kar, Communication optimality tradeoffs for distributed estimation. arXiv preprint arXiv:1801.04050 (2018).
 38
A. K. Sahu, D. Jakovetic, S. Kar, in International Symposium on Information Theory, ISIT. CREDO: A communicationefficient distributed estimation algorithm (Vail, 2018).
 39
A. K. Sahu, S. Kar, Distributed sequential detection for Gaussian shiftinmean hypothesis testing. IEEE Trans. Sig. Process. 64(1), 89–103 (2016).
 40
A. K. Sahu, S. Kar, J. M. F. Moura, H. V. Poor, Distributed constrained recursive nonlinear leastsquares estimation: algorithms and asymptotics. IEEE Trans. Sig. Inf. Process. Over Networks. 2(4), 426–441 (2016).
 41
D. J. Sakrison, Efficient recursive estimation; application to estimating the parameters of a covariance function. Int. J. Eng. Sci. 3(4), 461–483 (1965).
 42
C. J. Stone, Adaptive maximum likelihood estimators of a location parameter. Ann. Stat. 3(2), 267–284 (1975).
 43
Z. J. Towfic, J. Chen, A. H. Sayed, Excessrisk of distributed stochastic learners. IEEE Trans. Inf. Theory. 62(10), 5753–5785 (2016).
 44
K. Tsianos, S. Lawlor, M. G. Rabbat, in NIPS’12 Proceedings of the 25th International Conference on Neural Information Processing Systems  Volume 2. Communication/computation tradeoffs in consensusbased distributed optimization (Curran Associates Inc.Lake Tahoe, 2012), pp. 1943–1951.
 45
K. I. Tsianos, S. F. Lawlor, J. Y. Yu, M. G. Rabbat, in Global Conference on Signal and Information Processing (GlobalSIP), 2013 IEEE. Networked optimization with adaptive communication (IEEEAustin, 2013), pp. 579–582.
 46
Z. Wang, Z. Yu, Q. Ling, D. Berberidis, G. B. Giannakis, Decentralized RLS with dataadaptive censoring for regressions over largescale networks. IEEE Trans. Signal Proc.66(6) (2018).
 47
Y. Zhang, J. Duchi, M. Wainwright, in Proceedings of the 26th Annual Conference on Learning Theory, PMLR. Vol. 30. Divide and conquer kernel ridge regression (Princeton, 2013), pp. 592–617.
 48
F. Zhao, L. J. Guibas, L. Guibas. Wireless Sensor Networks: An Information Processing Approach (Morgan KaufmannSan Francisco, 2004).
Funding
This work is supported by the IBiDaaS project, funded by the European Commission under Grant Agreement No. 780787. This publication reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein. The work of D. Jakovetic is also supported in part by the Serbian Ministry of Education, Science, and Technological Development, grant 174030. The work is also partially supported by the National Science Foundation under grant CCF1513936.
Availability of data and materials
The data used in this paper is synthetic and is generated as described in Section 5 of the paper. Please contact authors for data requests.
Author information
Affiliations
Contributions
AKS lead the writing of Sections 2–5 and Appendix, he also lead carrying out theoretical analysis, and he carried out numerical experiments in Section 5. He also contributed in writing Sections 1, 6, and 7. DJ lead the writing of Sections 1, 6, and 7. He also contributed in writing Sections 2–5 and Appendix and in developing the code for carrying out numerical results in Section 5. DB contributed in writing Sections 1–4. SK contributed in writing Sections 1–3 and Appendix. All authors read and approved the final manuscript.
Corresponding author
Correspondence to Dusan Jakovetic.
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Received
Accepted
Published
DOI
Keywords
 Distributed estimation
 Stochastic approximation
 Statistical inference
 Nonlinear least squares