Privatized graph federated learning

Rizk, Elsa; Vlaski, Stefan; Sayed, Ali H.

doi:10.1186/s13634-023-01049-4

Research
Open access
Published: 25 August 2023

Privatized graph federated learning

EURASIP Journal on Advances in Signal Processing volume 2023, Article number: 87 (2023) Cite this article

1143 Accesses
1 Citations
Metrics details

Abstract

Federated learning is a semi-distributed algorithm, where a server communicates with multiple dispersed clients to learn a global model. The federated architecture is not robust and is sensitive to communication and computational overloads due to its one-master multi-client structure. It can also be subject to privacy attacks targeting personal information on the communication links. In this work, we introduce graph federated learning, which consists of multiple federated units connected by a graph. We then show how graph-homomorphic perturbations can be used to ensure the algorithm is differentially private on the server level. While on the client level, we show that improvement in the differentially private federated learning algorithm can be attained through the addition of random noise to the updates, as opposed to the models. We conduct both convergence and privacy theoretical analyses and illustrate performance by means of computer simulations.

1 Introduction

Federated learning (FL) [1] is one particular distributed structure where users no longer need to send their data to a server for training. Instead, data remains local, and training happens in collaboration between different clients and the server. Compared to a fully decentralized solution, communication occurs between the server and the clients (or agents), instead of directly between the agents themselves. Such a solution is advantageous in the sense that users no longer need to worry about sharing their data with an unknown party, and the high cost of sending all their raw data is eliminated. In this way, the data stays locally safe on a user’s device, and no extra communication cost is incurred for transferring the data remotely. However, such a distributed architecture is not robust to communication failures and computational overloads, nor it is immune to privacy attacks when agents are required to share their local updates. In standard FL, millions of users can be connected to one server at a time. This means one server will need to be responsible for the communication with all clients with significant computational burden, thus rendering the system susceptible to communication failures. Furthermore, whether clients send their gradient updates or their local models, information about their data can be inferred from the exchanges and leaked [2,3,4,5]. Consider for instance the logistic risk; the gradient of the loss function is a constant multiple of the feature vector. Thus, even though the actual data samples are not sent to the server, information about them can still be inferred from the gradient updates or the models.

These considerations motivate us to propose an architecture for federated learning with privacy guarantees. In particular, we introduce the graph federated architecture, which consists of multiple servers, and we privatize the algorithm by ensuring the communication occuring between the servers and the clients is secure. Graph-homomorphic perturbations, which were initially introduced in [6], focus on the communication between servers. They are based on adding correlated noise to the messages sent between servers such that the noise cancels out if we were to take the average of all messages across all servers. As for the privatization between the clients and their servers, we share noisy updates as opposed to models. The two protocols make sure the effect of the added noise is reduced.

Other works have also contributed to addressing the same challenges we are considering in this work, albeit differently. For example, the work [7] introduces a hierarchical architecture, where it is assumed there are multiple servers connected in a tree structure. Such a solution still has one main server and thus faces the same robustness problem as FL. The graph federated learning architecture in this work (and which appeared in the earlier conference publication [8]) is a more general structure. The work [9] generalizes the standard distributed learning framework to include local updates, while [10] has a similar architecture to the GFL architecture proposed earlier in [8], it nevertheless does not deal with privacy and employs different objective functions and a different learning algorithm based on the alternating direction method of multipliers. Likewise, a plethora of solutions exist that relate to privacy issues. These methods may be split into two sub-groups: those using random perturbations to ensure a certain level of differential privacy [11,12,13,14,15,16,17,18,19,20], or those that rely on cryptographic methods [21,22,23,24,25]. Both have their advantages and disadvantages. While differential privacy is easy to implement, it hinders the performance of the algorithm by reducing the model utility. As for cryptographic methods, they are generally harder to implement since they require more computational and communication power [26, 27]. Furthermore, they restrict the number of participating users. Moving forward, we go ahead with the study of differentially private methods.

The main contribution in this work is three-fold. We introduce a new generalized and more realistic architecture for the federated setting where we now consider multiple servers connected by some graph structure. Furthermore, many earlier works have proposed adding Laplacian noise sources to the shared information among agents in order to ensure some level of privacy. However, these works have largely ignored the fact that these noises degrade the mean-square error (MSE) performance of the network from $O(\mu )$ down to $O(\mu ^{-1})$, where $\mu$ is the small learning parameter. To resolve this issue, we define a new noise generation scheme that mantains the MSE at O(1) while ensuring privacy. Although the work [20] proposed a noisy-distributed consensus strategy, this reference lacks a useful construction method for the perturbations. In this work, we devise a construction scheme. Therefore, the main difference between our proposed method and previous works is that we devise a noise construction scheme that ensures the total sum of the added noise cancels out centrally. This results in the improved MSE bound of O(1). Finally, we prove that clients sharing noisy updates as opposed to noisy models lead to improved performance relative to what is commonly done in the prior literature. Moreover, we do not assume bounded gradients, as commonly assumed in previous works [12, 15, 16], since this condition does not actually hold in most situations in practice. Note, for instance, that even quadratic risks do not have bounded gradients. For this reason, we will not rely on this condition, and will instead be able to show that our noise construction is able to ensure differential privacy with high probability for most cases of interest. The main results shown in this work are as follows:

1.
Privatized GFL under graph-homomorphic perturbations converges in the MSE sense to an O(1) neighbourhood of the true model $w^o$ as opposed to $O(\mu ^{-1})$ when random perturbations are used instead.
2.
Privatized FL under perturbed gradients converges in the MSE sense to an $O(\mu )$ neighbourhood of the true model $w^o$ as opposed to $O(\mu ^{-1})$ when perturbed models are shared instead.
3.
GFL with graph-homomorphic perturbations and perturbed gradients is $\epsilon (i)$-differentially private with high probability.

2 Graph federated architecture

In the graph federated architecture, which we initially introduced in [8], we consider P federated units connected by a graph structure. Each federated unit consists of a server and a set of K agents. Thus, the overall architecture can be represented as a graph depicted in Fig. 1. We denote the combination matrix connecting the servers by $A \in {\mathbb {R}}^{P\times P }$, and we write $a_{mp}$ to refer to the elements of A. We assume each agent of every server has its own dataset $\{x_{p,k,n}\}_{n=1}^{N_{p,k}}$ that is non-iid when compared to the other agents. The subscript p refers to the federated unit, k to the agent, and n to the data sample. We note the difference between our proposed architecture and a fully distributed setting. The graph federated architecture consists of a network of federated units while a fully distributed network removes the need for servers and assumes clients are connected to each other based on some graph structure. Such an architecture is an improvement on the original federated architecture and not necessarily on the fully distributed architecture. Instead of clients communicating with the same server, we split the load among multiple servers.

With this architecture, we associate a convex optimization problem that will take into account the cost function at each federated unit. Thus, the optimization goal is to find the optimal global model $w^o$ that minimizes an average empirical risk:

$$\begin{aligned} w^o \,\overset{\Delta }{=}\,\mathop {\textrm{argmin}}\limits _{w\in {\mathbb {R}}^M} \frac{1}{P}\sum _{p=1}^P \frac{1}{K}\sum _{k=1}^K J_{p,k}(w), \end{aligned}$$

(1)

where each individual cost is an empirical risk defined over the local loss functions $Q_{p,k}(\cdot ;\cdot )$:

$$\begin{aligned} J_{p,k} (w) \,\overset{\Delta }{=}\,\frac{1}{N_{p,k}} \sum _{n=1}^{N_{p,k}} Q_{p,k}(w;x_{p,k,n}). \end{aligned}$$

(2)

To solve problem (1) each federated unit p runs the standard federated averaging (FedAvg) algorithm [1]. An iteration i of the algorithm consists of the server p selecting a subset of L participating agents ${\mathcal {L}}_{p,i}$. Then, in parallel, each agent runs a series of stochastic gradient descent (SGD) steps. We call these local steps epochs, and denote an epoch by the letter e and the total number of epochs by $E_{p,k}$. The sampled data point at an agent k in the federated unit p during the $e^{th}$ epoch of iteration i is denoted by b. Thus, during an iteration i, each participating agent $k \in {\mathcal {L}}_{p,i}$ updates the last model ${\varvec{w}}_{p,i-1}$ and sends its new model ${\varvec{w}}_{p,k,E_{p,k}}$ to the server after $E_{p,k}$ epochs. During a single epoch e, the agent updates its current local model $w_{p,k,e-1}$ by running a single SGD step. Thus, an agent repeats the following adaptation step for $e=1,2,\ldots , E_{p,k}$:

$$\begin{aligned} {\varvec{w}}_{p,k,e} =&\,{\varvec{w}}_{p,k,e-1} - \frac{ \mu }{E_{p,k}} \nabla _{w^{\textsf{T}}} Q_{p,k}({\varvec{w}}_{p,k,e-1};{\varvec{x}}_{p,k,b}), \end{aligned}$$

(3)

with ${\varvec{x}}_{p,k,b}$ be the sampled data of agent k in federated unit p, and ${\varvec{w}}_{p,k,0} = {\varvec{w}}_{p,i-1}$. After all the participating agents $k \in {\mathcal {L}}_{p,i}$ run all their epochs, the server aggregates their final models ${\varvec{w}}_{p,k,E_{p,k}}$, which we rename as ${\varvec{w}}_{p,k,i}$ since it is the final local model at iteration i:

$$\begin{aligned} {\varvec{\psi }}_{p,i} = \frac{1}{L}\sum _{k\in {\mathcal {L}}_{p,i}} {\varvec{w}}_{p,k,i}. \end{aligned}$$

(4)

Next, at the server level, these estimates are combined across neighbourhoods using a diffusion type strategy, where we first consider the previous steps (3) and (4) as the adaptation step and the following step as the combination step:

$$\begin{aligned} {\varvec{w}}_{p,i} = \sum _{m\in {\mathcal {N}}_p}a_{pm}{\varvec{\psi }}_{m,i}. \end{aligned}$$

(5)

To introduce privacy, the models communicated at each round between the agents and the servers need to be encrypted in some way. We could either apply secure multiparty computation (SMC) tools, like secret sharing, or use differential privacy. We focus on differential privacy or masking tools that can be represented by added noise. Thus, we let agent 1 in federated unit 2 add a noise component ${\varvec{g}}_{2,1,i}$ to its final model ${\varvec{w}}_{2,1,i}$ at iteration i, and then let serever 2 add ${\varvec{g}}_{12,i}$ to the message ${\varvec{\psi }}_{2,i}$ it sends to server 1. More generally, we denote by ${\varvec{g}}_{pm,i}$ the noise added to the message sent by server m to server p at iteration i. Similarly, we denote by ${\varvec{g}}_{p,k,i}$ the noise added to the model sent by agent k to server p during the ith iteration. We use unseparated subscripts pm for the inter-server noise components to point out their ability to be combined into a matrix structure. Contrarily, the agent-server noise components’ subscripts are separated by a comma to highlight a hierarchical structure. Thus, the privatized algorithm can be written as a client update step (6), a server aggregation step (7), and a server combination step (8):

$$\begin{aligned} {\varvec{w}}_{p,k,i}&= {\varvec{w}}_{p,i-1} - \frac{\mu }{E_{p,k}} \sum _{e=1}^{E_{p,k}} \nabla _{w^{\textsf{T}}} Q_{p,k}({\varvec{w}}_{p,k,e-1};{\varvec{x}}_{p,k,b}), \end{aligned}$$

(6)

$$\begin{aligned} {\varvec{\psi }}_{p,i}&= \frac{1}{L}\sum _{k \in {\mathcal {L}}_{p,i}} {\varvec{w}}_{p,k,i} + {\varvec{g}}_{p,k,i}, \end{aligned}$$

(7)

$$\begin{aligned} {\varvec{w}}_{p,i}&= \sum _{m\in {\mathcal {N}}_p} a_{pm}({\varvec{\psi }}_{m,i} + {\varvec{g}}_{pm,i}). \end{aligned}$$

(8)

The client update step (6) follows from (3) by combining the multiple epochs for $e=1,2,\ldots , E_{p,k}$ into one update step, with ${\varvec{w}}_{p,k,i} = {\varvec{w}}_{p,k,E_{p,k}}$ and ${\varvec{w}}_{p,k,0} = {\varvec{w}}_{p,i-1}$, namely:

$$\begin{aligned} {\varvec{w}}_{p,k,E_{p,k}}&= {\varvec{w}}_{p,k,E_{p,k}-1} - \frac{\mu }{E_{p,k}} \nabla _{w^{\textsf{T}}} Q_{p,k}({\varvec{w}}_{p,k,E_{p,k}-1};{\varvec{x}}_{p,k,b}) \\&= {\varvec{w}}_{p,k,E_{p,k,}-2} -\frac{\mu }{E_{p,k}} \sum _{e=E_{p,k}-1}^{E_{p,k}} \nabla _{w^{\textsf{T}}} Q_{p,k}({\varvec{w}}_{p,k,e-1};{\varvec{x}}_{p,k,b}) \\&= {\varvec{w}}_{p,k,0} -\frac{\mu }{E_{p,k}}\sum _{e=1}^{E_{p,k}} \nabla _{w^{\textsf{T}}} Q_{p,k}({\varvec{w}}_{p,k,e-1};{\varvec{x}}_{p,k,b}). \end{aligned}$$

(9)

3 Performance analysis

In this section, we show a list of results on the performance of the algorithm. We study the convergence of the privatized algorithm (6)–(8), and examine the effect of privatization on performance.

3.1 Modeling conditions

To go forward with our analysis, we require certain reasonable assumptions on the graph structure and cost functions.

Assumption 1

(Combination matrix) The combination matrix A describing the graph is symmetric and doubly-stochastic, i.e.:

$$\begin{aligned} a_{pm} = a_{mp}, \quad \sum _{m=1}^P a_{mp} = 1. \end{aligned}$$

(10)

Furthermore, the graph is strongly-connected and A satisfies:

$$\begin{aligned} \iota _2 \,\overset{\Delta }{=}\,\rho \left( A -\frac{1}{P}\mathbbm {1}\mathbbm {1}^\textsf{T}\right) < 1. \end{aligned}$$

(11)

$\square$

Assumption 2

(Convexity and smoothness) The empirical risks $J_{p,k}(\cdot )$ are $\nu -strongly$ convex, and the loss functions $Q_{p,k}(\cdot ;\cdot )$ are convex, namely for $\nu > 0$,:

$$\begin{aligned} J_{p,k}(w_2)&\ge J_{p,k}(w_1) + \nabla _{w^{\textsf{T}}} J_{p,k}(w_1)(w_2-w_1) + \frac{\nu }{2}\Vert w_2 - w_1 \Vert ^2, \end{aligned}$$

(12)

$$\begin{aligned} Q_{p,k}(w_2;\cdot )&\ge Q_{p,k}(w_1;\cdot ) + \nabla _{w^{\textsf{T}}} Q_{p,k}(w_1;\cdot ) (w_2 - w_1). \end{aligned}$$

(13)

Furthermore, the loss functions have $\delta$-Lipschitz continuous gradients, meaning there exists $\delta >0$ such that for any data point $x_{p,n}$:

$$\begin{aligned} \Vert \nabla _{w^{\textsf{T}}} Q_{p,k}(w_2;x_{p,k,n}) - \nabla _{w^{\textsf{T}}} Q_{p,k}(w_1;x_{p,k,n})\Vert \le \delta \Vert w_2 - w_1\Vert . \end{aligned}$$

(14)

$\square$

We also require a bound on the difference between the global optimal model $w^o$ and the local optimal models $w^o_{p,k}$ that optimize $J_{p,k}(\cdot )$. This assumption is used to bound the gradient noise and the incremental noise defined further ahead. It is not a restrictive assumption, and it imposes a condition on when collaboration is sensical among different agents. In other words, since the agents have non-iid data, sometimes their optimal models are too different and collaboration would hurt their individual performance. For example, when considering recommender systems, people in the same country are more likely to get the same movie recommended as opposed to across different countries. This means, people of the same country might have different models but relatively close contrary to different countries.

Assumption 3

(Model drifts) The distance of each local model $w_{p,k}^o$ to the global model $w^o$ is uniformly bounded, i.e., there exists $\xi \ge 0$ such that $\Vert w^o - w_p^o\Vert \le \xi$.

3.2 Network centroid convergence

We study the convergence of the algorithm from the network centroid’s ${\varvec{w}}_{c,i}$ perspective:

$$\begin{aligned} {\varvec{w}}_{c,i} \,\overset{\Delta }{=}\,\frac{1}{P}\sum _{p=1}^P {\varvec{w}}_{p,i}. \end{aligned}$$

(15)

We write the central recursion as:

$$\begin{aligned} {\varvec{w}}_{c,i}&= {\varvec{w}}_{c,i-1} - \mu \frac{1}{PL}\sum _{p=1}^P \sum _{k\in {\mathcal {L}}_{p,i}} \frac{1}{E_{p,k}} \sum _{e=1}^{E_{p,k}} \nabla _{w^{\textsf{T}}} Q_{p,k} ({\varvec{w}}_{p,k,e-1};{\varvec{x}}_{p,k,b}) \\&\quad + \frac{1}{PL} \sum _{p=1}^P \sum _{k\in {\mathcal {L}}_{p,i}} {\varvec{g}}_{p,k,i}+ \frac{1}{P}\sum _{p,m = 1}^P a_{pm} {\varvec{g}}_{pm,i}. \end{aligned}$$

(16)

Next, we define the model error as ${\widetilde{{\varvec{w}}}}_{c,i} \,\overset{\Delta }{=}\,w^o - {\varvec{w}}_{c,i}$ and the average gradient noise:

$$\begin{aligned} {\varvec{s}}_i \,\overset{\Delta }{=}\,\frac{1}{P}\sum _{p=1}^P {\varvec{s}}_{p,i}, \end{aligned}$$

(17)

with the per-unit gradient noise ${\varvec{s}}_{p,i}$:

$$\begin{aligned} {\varvec{s}}_{p,i} \,\overset{\Delta }{=}\,\widehat{\nabla _{w^{\textsf{T}}} J_{p}}({\varvec{w}}_{p,i-1}) - \nabla _{w^{\textsf{T}}} J_p({\varvec{w}}_{p,i-1}), \end{aligned}$$

(18)

and

$$\begin{aligned} \widehat{\nabla _{w^{\textsf{T}}} J_p}(\cdot )&\,\overset{\Delta }{=}\,\frac{1}{L} \sum _{k\in {\mathcal {L}}_{p,i}} \frac{1}{E_{p,k}} \sum _{e=1}^{E_{p,k}} \nabla _{w^{\textsf{T}}} Q_{p,k}(\cdot ;{\varvec{x}}_{p,k,b}). \end{aligned}$$

(19)

We introduce the average incremental noise ${\varvec{q}}_i$ and the local incremental noise ${\varvec{q}}_{p,i}$, which capture the error introduced by the multiple local update steps:

$$\begin{aligned} {\varvec{q}}_i&\,\overset{\Delta }{=}\,\frac{1}{P}\sum _{p=1}^P {\varvec{q}}_{p,i}, \end{aligned}$$

(20)

$$\begin{aligned} {\varvec{q}}_{p,i}&\,\overset{\Delta }{=}\,\frac{1}{L} \sum _{k \in {\mathcal {L}}_{p,i}} \frac{1}{E_{p,k}} \sum _{e=1}^{E_k} \Big ( \nabla _{w^{\textsf{T}}} Q_{p,k}({\varvec{w}}_{p,k,e-1}; {\varvec{x}}_{p,k,b}) - \nabla _{w^{\textsf{T}}} Q({\varvec{w}}_{p,i-1}; {\varvec{x}}_{p,k,b})\Big ) \end{aligned}$$

(21)

We then arrive at the following error recursion:

$$\begin{aligned} {\widetilde{{\varvec{w}}}}_{c,i} = {\widetilde{{\varvec{w}}}}_{c,i-1} + \mu \frac{1}{P}\sum _{p=1}^P \nabla _{w^{\textsf{T}}} J_p({\varvec{w}}_{p,i-1}) + \mu {\varvec{s}}_i + \mu {\varvec{q}}_i- {\varvec{g}}_{i}, \end{aligned}$$

(22)

where ${\varvec{g}}_{i}$ is the total added noise at iteration i:

$$\begin{aligned} {\varvec{g}}_{i} \,\overset{\Delta }{=}\,\frac{1}{PL} \sum _{p=1}^P \sum _{k \in {\mathcal {L}}_{p,i}} {\varvec{g}}_{p,k,i} + \frac{1}{P}\sum _{p,m=1}^P a_{pm}{\varvec{g}}_{pm,i} \end{aligned}$$

(23)

We estimate the first and second-order moments of the gradient noise in the following lemma. To do so, we use the fact, shown in previous work (Lemma 1 in [28]), that the individual gradient noise is zero-mean with a bounded second order moment:

$$\begin{aligned} {\mathbb {E}}\left\{ \Vert {\varvec{s}}_{p,i} \Vert ^2 | {\mathcal {F}}_{i-1}\right\} \le \beta _{s,p}^2 \Vert {\widetilde{{\varvec{w}}}}_{p,i-1}\Vert ^2 + \sigma _{s,p}^2, \end{aligned}$$

(24)

where the constants are defined as:

$$\begin{aligned} \beta _{s,p}^2&\,\overset{\Delta }{=}\,\frac{6\delta ^2}{L} \left( 1 + \frac{1}{K}\sum _{k=1}^K \frac{1}{E_{p,k}}\right) , \end{aligned}$$

(25)

$$\begin{aligned} \sigma _{s,p}^2&\,\overset{\Delta }{=}\,\frac{1}{LK}\sum _{k=1}^K \left( \frac{12}{E_{p,k}} + 3\right) \frac{1}{N_{p,k}}\sum _{n=1}^{N_{p,k}} \Vert \nabla _{w^{\textsf{T}}} Q_{p,k}(w^o;x_{p,k,n})\Vert ^2, \end{aligned}$$

(26)

and ${\mathcal {F}}_{i-1}$ is the filtration defined over the randomness introduced by all the past subsampling of the data for the calculation of the stochastic gradient. Using Assumption 3, we can guarantee that $\sigma _{s,p}^2$ is bounded by bounding:

$$\begin{aligned} \Vert \nabla _{w^{\textsf{T}}} Q_{p,k}(w^o; x_{p,k,n})\Vert ^2 \le 2\Vert \nabla _{w^{\textsf{T}}} Q_{p,k}(w^o_{p,k};x_{p,k,n})\Vert ^2 + 2\delta ^2 \xi ^2. \end{aligned}$$

(27)

Lemma 1

(Estimation of first and second-order moments of the gradient noise) The gradient noise defined in (17) is zero-mean and has a bounded second-order moment:

$$\begin{aligned} {\mathbb {E}}\left\{ \Vert {\varvec{s}}_i \Vert ^2 | {\mathcal {F}}_{i-1} \right\}&\le \beta _s^2 \Vert {\widetilde{{\varvec{w}}}}_{c,i-1}\Vert ^2 + \sigma _s^2 + \frac{2}{P}\sum _{p=1}^P \beta _{s,p}^2 \Vert {\varvec{w}}_{p,i-1} - {\varvec{w}}_{c,i-1}\Vert ^2 \end{aligned}$$

(28)

where the constants $\beta _s^2$ and $\sigma _s^2$ are given by:

$$\begin{aligned} \beta _s^2&\,\overset{\Delta }{=}\,\frac{2}{P}\sum _{p=1}^P \beta _{s,p}^2, \quad \sigma _s^2 \,\overset{\Delta }{=}\,\frac{1}{P}\sum _{p=1}^P\sigma _{s,p}^2. \end{aligned}$$

(29)

Proof

The above result follows from applying the Jensen’s inequality and the bounds on the per-unit gradient noise ${\varvec{s}}_{p,i}$. $\square$

The new term found in the bound of the gradient term is what we call the network disagreement:

$$\begin{aligned} \frac{1}{P} \sum _{p=1}^P \Vert {\varvec{w}}_{p,i} - {\varvec{w}}_{c,i}\Vert ^2. \end{aligned}$$

(30)

It captures the difference in the path taken by the individual models versus the network centroid. We bound this difference in Lemma 3. However, before doing so, we show that the second order moment of the incremental noise is on the order of $O(\mu )$. From Lemma 5 in [28], we can bound the individual incremental noise:

$$\begin{aligned} {\mathbb {E}} \Vert {\varvec{q}}_{p,i}\Vert ^2 \le&a \mu ^2 {\mathbb {E}} \Vert {\widetilde{{\varvec{w}}}}_{p,i-1}\Vert ^2 + a \mu ^2\xi ^2 + \frac{1}{K}\sum _{k=1}^K (b_k\mu ^4 + c_k \mu ^2)\sigma _{q,p,k}^2, \end{aligned}$$

(31)

where the constants are given by:

$$\begin{aligned} a&\,\overset{\Delta }{=}\,\frac{4\delta ^2}{K}\sum _{k=1}^K \frac{(E_{p,k}+1)(1-\lambda )-1+\lambda ^{E_{p,k}+1}}{E_{p,k}^2(1-\lambda )^2}, \end{aligned}$$

(32)

$$\begin{aligned} b_k&\,\overset{\Delta }{=}\,\frac{2E_{p,k}(E_{p,k}+1)(1-\lambda )^2 - 4E_{p,k}(1-\lambda ) + 4\lambda }{E_{p,k}^2 (1-\lambda )^3 } -\frac{ 2\lambda ^{E_{p,k}+1}}{E_{p,k}^2 (1-\lambda )^3},\end{aligned}$$

(33)

$$\begin{aligned} c_k&\,\overset{\Delta }{=}\,\frac{E_{p,k}-1}{3E_{p,k}}, \end{aligned}$$

(34)

$$\begin{aligned} \lambda&\,\overset{\Delta }{=}\,1-2\nu \mu + 4\delta ^2\mu ^2, \end{aligned}$$

(35)

$$\begin{aligned} \sigma ^2_{q,p,k}&\,\overset{\Delta }{=}\,3\sum _{n=1}^{N_{p,k}} \Vert \nabla _{w^{\textsf{T}}} Q_{p,k}(w^o_{p,k};x_{p,k,n})\Vert ^2. \end{aligned}$$

(36)

The following result follows.

Lemma 2

(Estimation of second-order moment of the incremental noise) The incremental noise defined in (20) has a bounded second-order moment:

$$\begin{aligned} {\mathbb {E}} \Vert {\varvec{q}}_i\Vert ^2&\le O(\mu ) {\mathbb {E}} \Vert {\widetilde{{\varvec{w}}}}_{c,i-1}\Vert ^2 + O(\mu )\xi ^2 + O(\mu ^2 )\sigma _{q}^2 \\&\quad + \frac{O(\mu )}{P}\sum _{p=1}^P {\mathbb {E}}\Vert {\varvec{w}}_{p,i-1} - {\varvec{w}}_{c,i-1}\Vert ^2, \end{aligned}$$

(37)

where the constant $\sigma _q^2$ is the average of $\sigma _{q,p,k}^2$:

$$\begin{aligned} \sigma _{q}^2 \,\overset{\Delta }{=}\,\frac{1}{PK}\sum _{p=1}^P\sum _{k=1}^K (b_k\mu ^4 + c_k \mu ^2)\sigma _{q,p,k}^2. \end{aligned}$$

(38)

Proof

The above result follows from applying the Jensen inequality and the bounds on the per-unit incremental noise ${\varvec{q}}_{p,i}$. Furthermore, $a = O(\mu ^{-1}), b_k = O(\mu ^{-1}),$ and $c_k = O(1)$ reduce the expression to (37). $\square$

We now bound the network disagreement. To do so, we first introduce the eigendecomposition of $A = QH Q^\textsf{T}$:

$$\begin{aligned} Q \,\overset{\Delta }{=}\,\begin{bmatrix} \frac{1}{\sqrt{P}}\mathbbm {1}&Q_{\theta } \end{bmatrix}, \quad H \,\overset{\Delta }{=}\,\begin{bmatrix} 1 &{} 0 \\ 0 &{} H_{\theta } \end{bmatrix}, \end{aligned}$$

(39)

where $H_{\theta }$ is a diagonal matrix that includes the last $(P-1)$ eigenvalues of A and $Q_{\theta }$ their corresponding eigenvectors.

Lemma 3

(Network disagreement) The average deviation from the centroid is bounded during each iteration i:

$$\begin{aligned} \frac{1}{P}\sum _{p=1}^P {\mathbb {E}}\Vert {\varvec{w}}_{p,i} - {\varvec{w}}_{c,i}\Vert ^2&\le \frac{ \iota _2^i }{P} {\mathbb {E}} \Vert (Q_{\epsilon } \otimes I){\varvec{{ {\mathcal {W}}}}}_0\Vert ^2 + \frac{\iota _2^2 }{P}\sum _{j'=0}^{i-1}\iota _2^{j'}\sum _{p=1}^P \Bigg \{ \mu ^2\bigg (\frac{2\delta ^2}{\iota _2(1-\iota _2) } \\&\quad +\beta _{s,p}^2 + O(\mu ) \bigg ) \bigg ( \lambda _p^{j'} A^{j'}[p] \text{ col }\left\{ {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{p,0}\Vert ^2\right\} _{p=1}^P + \sum _{j=0}^{j'-1} \lambda _p^j \\&\quad \times A^j[p]\text{ col }\left\{ \mu ^2 \sigma _{s,p}^2 + O(\mu ^2)\xi ^2 + O(\mu ^3)\sigma _{q,p}^2 + \sigma _{g,p}^2\right\} _{p=1}^P \bigg ) \\&\quad + \mu ^2\frac{2\Vert \nabla _{w^{\textsf{T}}} J_p(w^o)\Vert ^2}{\iota _2(1-\iota _2)}+ \mu ^2\sigma _{s,p}^2 + O(\mu ^3)\xi ^2 + O(\mu ^4)\sigma _{q,p}^2 \\&\quad + \frac{1}{\iota _2^2} \sigma _{g,p}^2\Bigg \}, \end{aligned}$$

(40)

where ${\varvec{{ {\mathcal {W}}}}}_{0} \,\overset{\Delta }{=}\,\text{ col }\left\{ {\varvec{w}}_{p,0}\right\} _{p=1}^P$ and $\lambda _p \,\overset{\Delta }{=}\,\sqrt{1-2\nu \mu + \delta ^2\mu ^2} + \beta _{s,p}^2 \mu ^2 + O(\mu ^2) \in (0,1)$. Then, in the limit:

$$\begin{aligned} \limsup _{i\rightarrow \infty } \frac{1}{P}\sum _{p=1}^P {\mathbb {E}} \Vert {\varvec{w}}_{p,i} -{\varvec{w}}_{c,i}\Vert ^2 \le&\frac{\iota _2^2}{P(1-\iota _2)} \sum _{p=1}^P \mu ^2 \sigma _{s,p}^2 + \frac{1}{\iota _2^2}\sigma _{g,p}^+ O(\mu )\sigma _{g,p}^2 + O(\mu ^3). \end{aligned}$$

(41)

Proof

See “Appendix 2”. $\square$

Thus, from the above lemma, we see that the individual models gravitate to the centroid model with an error introduced due to the added privatization. The effect of the added noise overpowers that of the gradient and incremental noise, since the later is on the order of the step-size.

Then, using the above result, we can establish the convergence of the centroid model to a neighbourhood of the true optimal model $w^o$ in the mean-square-error (MSE) sense.

Theorem 1

(Centroid MSE convergence) Under Assumptions 1, 2 and 3, the network centroid converges to the optimal point $w^o$ exponentially fast for a sufficiently small step-size $\mu$:

$$\begin{aligned} {\mathbb {E}} \Vert {\widetilde{{\varvec{w}}}}_{c,i}\Vert ^2&\le \lambda _c {\mathbb {E}} \Vert {\widetilde{{\varvec{w}}}}_{c,i-1} \Vert ^2 +\mu ^2 \sigma _s^2 + O(\mu ^{2})\xi ^2 + O(\mu ^{3})\sigma _q^2 + {\mathbb {E}} \Vert {\varvec{g}}_{i}\Vert ^2 \\&\quad + \frac{O(\mu )}{P}\sum _{p=1}^P {\mathbb {E}}\Vert {\varvec{w}}_{p,i-1}-{\varvec{w}}_{c,i-1}\Vert ^2, \end{aligned}$$

(42)

where $\lambda _c = \sqrt{1-2\nu \mu + \delta ^2\mu ^2} +\beta _s^2\mu ^2 + O(\mu ^{2}) \in (0,1)$. Then, letting i tend to infinity, we get:

$$\begin{aligned} \limsup _{i\rightarrow \infty } {\mathbb {E}} \Vert {\widetilde{{\varvec{w}}}}_{c,i}\Vert ^2&\le \frac{\mu ^2 \sigma _s^2 + O(\mu ^{2})\xi ^2 + O(\mu ^{3})\sigma _q^2 + {\mathbb {E}}\Vert {\varvec{g}}\Vert ^2}{1-\lambda _c} + \sum _{p=1}^PO(1 ) \sigma _{g,p}^2+ O(\mu ). \end{aligned}$$

(43)

Proof

See “Appendix 3”. $\square$

The main term in the above bound is the variance of the added noise with a dominating factor of $\mu ^{-1}$, since:

$$\begin{aligned} 1- \lambda _c&= 1- \sqrt{1-O(\mu ) + O(\mu ^2)} - O(\mu ^2)= O(\mu )- O(\mu ^2) = O(\mu ) \end{aligned}$$

(44)

which allows us to rewrite the bound as follows:

$$\begin{aligned} \limsup _{i \rightarrow \infty } {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{c,i}\Vert ^2&\le O(\mu )\sigma _s^2 + O(\mu )\xi ^2 + O(\mu ^2)\sigma _q^2 + O(\mu ^{-1}){\mathbb {E}}\Vert {\varvec{g}}\Vert ^2 \\&\quad + \sum _{p=1}^P O(1)\sigma _{g,p}^2 + O(\mu ), \end{aligned}$$

(45)

with ${\mathbb {E}}\Vert {\varvec{g}}\Vert ^2$ representing the variance of the total added noise, independent of time. While in general decreasing the step-size improves performance, the above result shows that this need not be the case with privatization. Thus, since the added noise impacts the model utility negatively, it is important to choose a privatization scheme that reduces the effect. In what follows, we look closely at such a scheme.

3.3 Graph-homomorphic perturbations

We consider a specific privatization scheme and specialize the above results. The goal of the scheme is to remove the $O(\mu ^{-1})$ term from the MSE bounds. Thus, focusing on the centroid model expression (16), we wish to cancel out the total added noise amongst servers, i.e.,

$$\begin{aligned} \sum _{p,m=1}^P a_{pm}{\varvec{g}}_{pm,i} = 0. \end{aligned}$$

(46)

To achieve this, we introduce graph-homomorphic perturbations defined as follows [6]. We assume each server p draws a sample ${\varvec{g}}_{p,i}$ independently from the Laplace distribution $Lap(0,\sigma _g/\sqrt{2})$ with variance $\sigma _{g}^2$. Server p then sets the noise ${\varvec{g}}_{mp,i}$ added to the message sent to its neighbour m as:

$$\begin{aligned} {\varvec{g}}_{mp,i} = {\left\{ \begin{array}{ll} {\varvec{g}}_{p,i} &{} m \ne p \\ - \frac{1-a_{pp}}{a_{pp}} {\varvec{g}}_{p,i}. \end{array}\right. } \end{aligned}$$

(47)

With such a construction, condition (46) is satisfied:

$$\begin{aligned} \sum _{p,m=1}^P a_{pm}{\varvec{g}}_{pm,i}&= \sum _{p \ne m} a_{pm} {\varvec{g}}_{p,i} - \sum _{p=1}^P a_{pp} \left( \frac{1-a_{pp}}{a_{pp}} \right) {\varvec{g}}_{p,i} \\&= \sum _{p=1}^P (1-a_{pp}) {\varvec{g}}_{p,i} - (1-a_{pp}) {\varvec{g}}_{p,i} = 0. \end{aligned}$$

(48)

Thus, with such a scheme, the noise components proportional to $O(\mu ^{-1})$ resulting from the noise added between the servers cancel out in the error recursions, however since gradients are evaluated at the local models ${\varvec{w}}_{p,i}$ and not at the centroid ${\varvec{w}}_{c,i}$, thus the effect of the noise is still evident. Yet, this remaining error introduced by the noise is controlled by the step-size. Thus, its effect can be mitigated by using a smaller step-size. In the next corollary, we show that if no noise is added amongst the clients and graph-homomorphic perturbations are used amongst servers, then the error converges to $O(1)\sigma _g^2$.

Corollary 1

(Centroid MSE convergence under graph-homomorphic perturbations) Under Assumptions 1, 2 and 3, the network centroid with graph-homomorphic perturbations converges to the optimal point $w^o$ exponentially fast for a sufficiently small step-size $\mu$:

$$\begin{aligned} {\mathbb {E}} \Vert {\widetilde{{\varvec{w}}}}_{c,i}\Vert ^2&\le \lambda _c {\mathbb {E}} \Vert {\widetilde{{\varvec{w}}}}_{c,i-1} \Vert ^2 +\mu ^2 \sigma _s^2 + O(\mu ^{2})\xi ^2 + O(\mu ^{3})\sigma _q^2 \\&\quad + \frac{O(\mu )}{P}\sum _{p=1}^P {\mathbb {E}}\Vert {\varvec{w}}_{p,i-1}-{\varvec{w}}_{c,i-1}\Vert ^2. \end{aligned}$$

(49)

Then, letting i tend to infinity, we get:

$$\begin{aligned} \limsup _{i\rightarrow \infty } {\mathbb {E}} \Vert {\widetilde{{\varvec{w}}}}_{c,i}\Vert ^2&\le \frac{\mu ^2 \sigma _s^2 +O(\mu ^{2})\xi ^2 + O(\mu ^{3})\sigma _q^2 }{1-\lambda _c} + \sum _{p=1}^PO(1 ) \sigma _{g,p}^2 + O(\mu ). \end{aligned}$$

(50)

Proof

Starting from (43), and replacing ${\mathbb {E}} \Vert {\varvec{g}}\Vert ^2 = 0$ because ${\varvec{g}}_{i} = 0$, we get the final result. $\square$

3.4 Sharing gradients as opposed to weight estimates

We next show that sharing gradients versus models is better for the performance under added noise. In the remainder of this section and for the sake of simplicity, we illustrate this conclusion by considering one federated unit, say for $p=1$. Thus, if we were to introduce differential privacy to federated learning, then a random Laplacian noise should be added to each model by the client before aggregation by the server, and the new privatized aggregation step will become:

$$\begin{aligned} {\varvec{w}}_{1,i} = \frac{1}{L}\sum _{k \in {\mathcal {L}}_{1,i}} \left( {\varvec{w}}_{1,k,i} + {\varvec{g}}_{1,k,i} \right) . \end{aligned}$$

(51)

However, if we were to study the MSE convergence of this privatized algorithm, we would notice a new $O(\mu ^{-1})\sigma _g^2$ term in the bound (Theorem 1). To address this degradation, we now describe an alternative implementation that shares gradients as opposed to weight estimates. Note first that the FL algorithm can be expressed in a single step taken from the server’s perspective:

$$\begin{aligned} {\varvec{w}}_{1,i} = {\varvec{w}}_{1,i-1} - \mu \frac{1}{L}\sum _{k \in {\mathcal {L}}_{1,i}} \frac{1}{E_{1,k}}\sum _{e=1}^{E_{1,k}} \widehat{\nabla _{w^{\textsf{T}}} J_{1,k}}({\varvec{w}}_{1,k,e-1}). \end{aligned}$$

(52)

This suggests that instead of every agent sharing its final model ${\varvec{w}}_{1,k,i}$, they could share the total update:

$$\begin{aligned} \frac{1}{E_{1,k}}\sum _{e=1}^{E_{1,k}} \widehat{\nabla _{w^{\textsf{T}}} J_{1,k}}({\varvec{w}}_{1,k,e-1}). \end{aligned}$$

(53)

The server then aggregates the updates from all participating agents and updates the previous model ${\varvec{w}}_{1,i-1}$. In this case, if we were to privatize this new version of the algorithm, we would add random noise to the updates which are then scaled by the step-size:

$$\begin{aligned} {\varvec{\psi }}_{1,k,i-1}&= \frac{1}{E_{1,k}} \sum _{e=1}^{E_{1,k}} \widehat{\nabla _{w^{\textsf{T}}} J_{1,k}}({\varvec{w}}_{1,k,e-1}), \end{aligned}$$

(54)

$$\begin{aligned} {\varvec{w}}_{1,i}&= {\varvec{w}}_{1,i-1} - \mu \frac{1}{L}\sum _{k\in {\mathcal {L}}_{1,i}} \left( {\varvec{\psi }}_{1,k,i-1} + {\varvec{g}}_{1,k,i} \right) . \end{aligned}$$

(55)

We show in the following theorem the effect of the added noise to the new FL algorithm. It turns out, the noise introduces an $O(\mu )$ error instead of $O(\mu ^{-1})$.

Theorem 2

(MSE convergence of privatized FL) Under Assumptions 2 and 3, the privatized FL algorithm (54)–(55) converges exponentially fast for a small enough step-size to a neighbourhood of the optimal model:

$$\begin{aligned} {\mathbb {E}} \Vert {\widetilde{{\varvec{w}}}}_{1,i}\Vert ^2&\le \lambda {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{1,i-1}\Vert ^2 +O( \mu ^2) \sigma _{s,1}^2 + O(\mu ^2)\xi ^2 + \frac{\mu ^2}{L}\sigma _{g,1}^2 + O(\mu ^3). \end{aligned}$$

(56)

where $\lambda = \sqrt{1-2\nu \mu + (\beta _{s,1}^2+\delta ^2)\mu ^2} + O(\mu ^2) \in (0,1)$. Then, in the limit:

$$\begin{aligned} \limsup _{i \rightarrow \infty } {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{1,i}\Vert ^2 \le O(\mu ) (\sigma _{s,1}^2 + \xi ^2 + \sigma _{g,1}^2) + O(\mu ^2). \end{aligned}$$

(57)

Proof

See “Appendix 6”. $\square$

Thus, sharing the updates instead of the models is advantageous since the effect of the added noise on the performance is reduced. The $O(\mu )$ factor allows us to increase the value of the noise variance while ensuring the model utility does not deteriorate significantly. Therefore, to guarantee an $\epsilon (i)$-DP algorithm, we let the added noise be a zero-mean Laplacian random variable with $\sigma _{g}^2$ variance.

4 Privacy analysis

We study the privacy of the algorithm (6)–(8) in terms of differential privacy. We focus on graph-homomorphic perturbations and show that the adopted scheme is differentially private. To do so, we first define what it means for an algorithm to be $\epsilon$-differentially private. Therefore, without loss of generality, assume agent 1 in federated unit 1 decides to not participate, and its data samples $x_{1,1}$ are replaced by a new set $x'_{1,1}$ with a different distribution. Then, with the new data, the algorithm takes a different path. We denote the new models by ${\varvec{w}}'_{p,k,i}$. The idea behind differential privacy is that an outside observant should not be able to distinguish between the two trajectories ${\varvec{w}}_{p,k,i}$ and ${\varvec{w}}'_{p,k,i}$ and conclude whether agent one participated in the training. More formally, differential privacy is defined bellow.

Definition 1

($\epsilon (i)$-Differential Privacy) We say that the algorithm given in (6)–(8) is $\epsilon (i)$-differentially private for server p at time i if the following condition holds on the joint distribution $f(\cdot )$:

$$\begin{aligned} \frac{f\left( \left\{ \left\{ {\varvec{\psi }}_{p,j} + {\varvec{g}}_{pm,j} \right\} _{m\in {\mathcal {N}}_p \setminus \{p\} } \right\} _{j=0}^i \right) }{f\left( \left\{ \left\{ {\varvec{\psi }}'_{p,j} + {\varvec{g}}_{pm,j} \right\} _{m\in {\mathcal {N}}_p \setminus \{p\} }\right\} _{j=0}^i \right) } \le e^{\epsilon (i)}. \end{aligned}$$

(58)

$\square$

Thus, the above definition states that minimaly varried trajectories have comparable probabilities. In addition, the smaller the value of $\epsilon$ is, the higher the privacy guarantee will be. Thus, the goal will be to decrease $\epsilon$ as long as the model utility is not strongly affected.

Next, in order to show that the algorithm is differentially private, we require the sensitivity of the algorithm to be bounded. The sensitivity at time i is defined as:

$$\begin{aligned} \Delta (i)&= \Vert {\varvec{{ {\mathcal {W}}}}}_{i} - {\varvec{{ {\mathcal {W}}}}}'_{i}\Vert . \end{aligned}$$

(59)

It measures the distance between the original and perturbed weight vectors. It is shown in “Appendix 4” that $\Delta (i)$ can be bounded as follows:

$$\begin{aligned} \Delta (i) \le B+B' + \sqrt{P}\Vert w^o-w^{'o}\Vert , \end{aligned}$$

(60)

for constants B and $B'$ chosen by the designer. Moreover, the above bound holds with high probability given by:

$$\begin{aligned} {\mathbb {P}}(\Delta (i) \le B + B' + \sqrt{P} \Vert w^o - w'^o\Vert )&\ge \left( 1- \frac{\lambda ^i_{\max } {\mathbb {E}}\Vert {\varvec{{ {\mathcal {W}}}}}_0\Vert ^2 + O(\mu ) + O(\mu ^{-1})}{B^2} \right) \\&\quad \times \left( 1- \frac{\lambda '^i_{\max } {\mathbb {E}}\Vert {\varvec{{ {\mathcal {W}}}}}'_0\Vert ^2 + O(\mu ) + O(\mu ^{-1})}{B'^2} \right) . \end{aligned}$$

(61)

This result shows that the sensitivity can be bounded with high probability, which in turn is dependent on the values chosen for B and $B'$. Larger values for these constants increase the probability, but nevertheless lead to a looser bound for privacy (as shown in Theorem 3). Therefore, the choice of B and $B'$ needs to be balanced judiciously to ensure the desired level of privacy.

Using the bound on the sensitivity and from the definition of differential privacy, we can finally show that the algorithm is differentially private with high probability.

Theorem 3

(Privacy of GFL algorithm) If the algorithm (6)–(8) adopts graph-homomorphic perturbations, then it is $\epsilon (i)$-differentially private with high probability, at time i for a standard deviation of $\sigma _g = \sqrt{2}(B+B'\sqrt{P}\Vert w^o-w'^o\Vert )(i+1) / \epsilon (i)$.

Proof

See “Appendix 5”. $\square$

Thus, the above theorem suggests, if we wish the algorithm to be $\epsilon (i)$-differentially private, then we need to choose the noise variance accordingly. The larger the variance is, the more private the algorithm will be. However, the longer the algorithm is run, we will require a larger noise variance to keep the same level of privacy guarantee. Said differently, if we fix the added noise, then as time passes, the algorithm becomes less private, and more information is leaked. However, with graph-homomorphic perturbations, we can afford to increase the variance since its effect is constant on the MSE, and thus decreases the leakage.

Moreover, we study the effect of the model drift on the privacy of the algorithm. Thus, if we examine closely the probability that the sensitivity is bounded, the model drift $\xi$ appears in the $O(\mu )$ term. The smaller the model drift is, we note that the higher the probability that the sensitivity is bounded. This in turn implies that the algorithm is differentially private with higher probability. Furthermore, if we study the average $\epsilon (i)$, we see that:

$$\begin{aligned} {\mathbb {E}}\, \epsilon (i)&= \frac{\sqrt{2}}{\sigma _g} \sum _{j=1}^i {\mathbb {E}} \Delta (j) \\&\le \frac{\sqrt{2}}{\sigma _g} \sum _{j=1}^i {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{p,j}\Vert + {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{p,j}'\Vert + \Vert w^o - w'^o\Vert \\&\le \frac{\sqrt{2}}{\sigma _g} \sum _{j=1}^i \lambda ^{j/2} {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{p,0}\Vert + \frac{1}{\sqrt{1-\lambda }} \left( O(\mu )(\sigma _{s,p}^2 + \xi ^2 + \sigma _g^2) + O(\mu ^{3/2}) \right) \\&\quad + \lambda '^{j/2} {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}'_{p,0}\Vert + \frac{1}{\sqrt{1-\lambda '}} \left( O(\mu )(\sigma '^2_{s,p} + \xi '^2 + \sigma _g^2) + O(\mu ^{3/2}) \right) \\&\quad + \Vert w^o - w'^o\Vert \\&\le \frac{1-\lambda ^{i/2}}{1-\lambda ^{1/2}} {\mathbb {E}} \Vert {\widetilde{{\varvec{w}}}}_{p,0}\Vert + \frac{1-\lambda ^{i/2}}{1-\lambda '^{1/2}} {\mathbb {E}} \Vert {\widetilde{{\varvec{w}}}}'_{p,0}\Vert + i\Vert w^o - w'^o\Vert \\&\quad + \frac{i}{\sqrt{1-\lambda }} \left( O(\mu ) (\sigma _{s,p}^2 + \xi ^2 + \sigma _g^2) + O(\mu ^{3/2}) \right) \\&\quad + \frac{i}{\sqrt{1-\lambda '}} \left( O(\mu ) (\sigma '^2_{s,p} + \xi '^2 + \sigma _g^2) + O(\mu ^{3/2}) \right) , \end{aligned}$$

(62)

as the model drift decreases, so does $\epsilon (i)$ on average. Therefore, with smaller model drift we can achieve higher privacy with more certainty.

5 Experimental analysis

We conduct a series of experiments to study the influence of privatization on the GFL algorithm. The aim of the experiments is to show the superior performance of graph-homomorphic perturbations to random perturbations and perturbations to gradients versus models, and to study the effect of different parameters on the performance of the algorithm.

5.1 Regression

We first start by studying a regression problem on simulated data. We do so for the tractability of the problem. We consider the quadratic loss that has a closed form solution, i.e., a formal expression for the true model $w^o$ is known, which makes the calculation of the mean square error feasible and more accurate.

Therefore, consider a streaming feature vector ${\varvec{u}}_{p,k,n} \in {\mathbb {R}}^M$ with output variable ${\varvec{d}}_{p,k}(n) \in {\mathbb {R}}$ given by:

$$\begin{aligned} {\varvec{d}}_{p,k}(n) = {\varvec{u}}_{p,k,n}^\textsf{T}w^{\star } + {\varvec{v}}_{p,k}(n), \end{aligned}$$

(63)

where $w^{\star }\in {\mathbb {R}}^M$ is some generating model, and ${\varvec{v}}_{p,k}(n)$ is some zero-mean Guassian random variable with $\sigma _{v_{p,k}}^2$ variance and independent of ${\varvec{u}}_{p,k,n}$. Then, the optimal model that solves the following problem:

$$\begin{aligned} \min _w \frac{1}{P}\sum _{p=1}^P \frac{1}{K}\sum _{k=1}^K \frac{1}{N_{p,k}}\sum _{n=1}^{N_{p,k}} \Vert {\varvec{d}}_{p,k}(n) - {\varvec{u}}_{p,k,n}^\textsf{T}w\Vert ^2 + \rho \Vert w\Vert ^2 \end{aligned}$$

(64)

is found to be:

$$\begin{aligned} w^o = ({\widehat{R}}_u + \rho I)^{-1}( {\widehat{R}}_u w^{\star } + {\widehat{r}}_{uv}), \end{aligned}$$

(65)

where ${\widehat{R}}_u$ and ${\widehat{r}}_{uv}$ are defined as:

$$\begin{aligned} {\widehat{R}}_u&\,\overset{\Delta }{=}\,\frac{1}{P}\sum _{p=1}^P \frac{1}{K}\sum _{k=1}^K \frac{1}{N_{p,k}} \sum _{n=1}^{N_k} {\varvec{u}}_{p,k,n}{\varvec{u}}_{p,k,n}^\textsf{T}, \end{aligned}$$

(66)

$$\begin{aligned} {\widehat{r}}_{uv}&\,\overset{\Delta }{=}\,\frac{1}{P}\sum _{p=1}^P \frac{1}{K}\sum _{k=1}^K \frac{1}{N_{p,k}} \sum _{n=1}^{N_k} {\varvec{v}}_{p,k}(n){\varvec{u}}_{p,k,n}. \end{aligned}$$

(67)

We consider $P = 10$ units, each with $K= 100$ total agents. We assume, $N_{p,k}=100$ for each agent. We randomly generate two-dimensional feature vectors ${\varvec{u}}_{p,k}(n)$ from a Gaussian random vector with zero-mean and a randomly generated covariance matrix $R_{u_{p,k}}$. We then calculate the corresponding outputs according to (63). To make the data non-iid across agents, we assume the covariance matrix $R_{u_{p,k}}$ is different for each agent, as well as the variance $\sigma _{v_{p,k}}^2$ of the added noise. When running the algorithm, we assume each unit samples at random $L = 11$ agents, and each agent runs $E_{p,k} \in [1,10]$ epochs and uses a mini-batch of $B_{p,k} \in [5,10]$ samples.

We compare three algorithms: the standard GFL algorithm, the privatized GFL algorithm with random perturbations, and the privatized GFL with homomorphic perturbations. We do not add noise between the clients and their server to focus on the effect of the perturbations between the servers. In the first set of simulations, we fix the step-size $\mu =0.7$ and the regularization parameter $\rho = 0.1$. We fix the variance of the added noise for privatization in both schemes to $\sigma _g^2 = 0.1$. We then plot the mean-square-deviation (MSD) at each time step for the centroid model:

$$\begin{aligned} \text{ MSD}_i \,\overset{\Delta }{=}\,\Vert {\varvec{w}}_{c,i} - w^o\Vert ^2, \end{aligned}$$

(68)

as seen in Fig. 2. We observe that the privatized GFL with random perturbations has lower performance compared to the other two algorithms. While, using homomorphic perturbations does not result in such a decay in performance. Thus, our suggested scheme does a good job at tracking the performance of the original GFL algorithm, while not compromising with the privacy level.

We next study the extent of the effect of the noise on the model utility. Thus, we run a series of experiments with varying added noise $\sigma _g^2 = \{0.001, 0.01, 0.1,1,2,10\}$ for the two privatized GFL algorithms. We plot the resulting MSD curves in Fig. 3a. We observe for a fixed step-size, as we increase the variance, the MSD of the algorithm with random perturbations increases significantly as opposed to the algorithm with homomorphic perturbations. Thus, we conclude that the algorithm with random perturbations is more sensitive to the variance of the added noise. In fact, at some point, while using random perturbations, for some variance, the algorithm breaks down. While using graph-homomorphic perturbations, delays that effect for much larger variance. In addition, as long as the step-size is small enough, we can always control the effect of the graph-homomorphic perturbations.

However, if we were to look at the individual MSD for one federated unit, we would discover that the performance of the algorithm decays as the noise variance is increased. Nonetheless, it is not to the extent of random perturbations. We plot in Fig. 3b the average individual MSD for the varying noise variance:

$$\begin{aligned} \text{ MSD}_{\text{ avg },i} \,\overset{\Delta }{=}\,\frac{1}{P}\sum _{p=1}^P \Vert {\varvec{w}}_{p,i} - w^o\Vert ^2. \end{aligned}$$

(69)

We observe that for a fixed noise variance, homomorphic perturbations results in a better performance. Furthermore, as we increase the noise variance, the network disagreement increases for both schemes. This comes as no surprise and is in accordance with Lemma 3. Furthermore, as previously mentioned, graph-homomorphic perturbations have the added value of not being negatively affected by the decrease in the step-size. In addition, even though the improvement does not seem significant, the source of the error of the two schemes is different. Furthermore, the information of the true model is distributed in the network and can be retrieved by running at the end of the learning algorithm a consensus-type step. At that point, the local models no longer contain information about the local data, and thus agents can safely share their models. However, when random perturbations are used, reconstruction is not possible since the information has been lost in the network due to the added perturbations.

We next fix the noise variance $\sigma _{g}^2 = 0.1$ and varying the step-size $\mu = \{0.1, 0.5, 1, 5 \}$. According to Theorem 4, the MSD resulting from random perturbations includes an $O(\mu ^{-1})$ term, which is not the case when using graph-homomorphic perturbations. Thus, we expect a decrease in the step-size will not significantly affect the privatized algorithm with graph-homomorphic perturbations as opposed to random perturbations. Indeed, as seen in Fig. 4, as $\mu$ is increased, the final MSD increases; this is probably due to the $O(\mu )\sigma _s^2$ term in the bound. In contrast, for significantly small or large $\mu$, the performance of the privatized algorithm with random perturbations decreases. In addition, what we observe for both privacy schemes, is that the rate of convergence slows down as we decrease the step-size. Thus, there exists an optimal step-size that achieves a good compromise between a fast convergence and a low MSD.

5.2 Privatized federated learning

We focus on the single server FL setting (i.e., $P = 1$), where we assume we have $K=1000$ agents of which we choose $L=30$ at a time. We generate non-iid datasets of varying size for each agent as in the previous section. We allow each agent to run varying epochs $E_k \in [1,10]$ during an iteration of the algorithm. We set the step-size $\mu = 0.2$, $\rho = 0.007$ and $\sigma _g^2 = 0.02$. We compare three algorithms: the standard FL algorithm, the privatized FL algorithm with sharing of models, and the privatized FL algorithm with sharing of updates. We plot the average MSD curves after repeating the experiment 100 times. As expected, the effect of the added noise is worse when models are shared (yellow curve Fig. 5) than when updates are shared (red curve Fig. 5).

We next study the effect of the step-size on the MSD of the privatized FL algorithm. We expect that as $\mu$ is increased the MSD increases for the FL algorithm when updates are shared. While, when models are shared, since the gradient noise variance is tuned by $\mu$ and the added noise variance by $\mu ^{-1}$, we expect to observe a trade-off. On one hand, as $\mu$ is increased the effect of the gradient noise is increased while that of the added noise is diminished. On the other hand, as $\mu$ is decreased, the effect of the added noise overpowers that of the gradient noise. Indeed, we observe this phenomenon in (a) and (b) of Fig. 6.

Finally, we study the effect of the variance of the added noise. We fix the step-size at $\mu =0.2$ and vary the noise variance $\sigma _g^2 = \{0.01,0.05, 0.1,0.5\}$. In the two cases, as we increase $\sigma _g^2$ the performance diminishes ((c), (d) of Fig. 6). However, the larger values of the added noise variance affect the perturbed models more than the perturbed gradients. The algorithm diverges for lower values of $\sigma _g^2$ in the case when models are shared as opposed to when gradients are shared. Thus, sharing updates can handle larger values of $\sigma _g^2$ before the algorithm diverges. In addition, since the variance is tuned by the step-size, we can always find a suitable $\mu$ to decrease its effect.

5.3 Classification

We now focus on a classification problem applied to a dataset on click rate prediction of ads. We consider the Avazu click through dataset [29]. We split the 5101 data unequally among a total of 50 agents. We assume there are $P = 5$ units each with $K = 10$ agents. We add non-idd noise to the data at each agent to change their distributions. We again compare three algorithms: standard GFL, privatized GFL with homomorphic perturbations, and privatized GFL with random perturbations. We use a regularized logistic risk with regularization parameter $\rho = 0.03$. We set the step-size $\mu = 0.5$. We repeat the algorithms for multiple levels of privacy. We then settle on a noise variance $\sigma _g^2 = 0.6$ for which the privatized algorithm with random perturbations still converges. We plot in Fig. 7 the testing error on a set of 256 clean samples that were not perturbed with noise to change their distributions. We use the centriod model learned during each iteration to calculate the corresponding testing error. We observe that the graph-homomorphic perturbations do not hinder the performance of the privatized model. As for random perturbations, they significantly reduce the utility of the learnt model.

6 Conclusion

In this work, we introduced graph federated learning and implemented an algorithm that guarantees privacy of the data in a differential privacy sense. We showed general privatization based on adding random perturbations to updates in federated learning have a negative effect on the performance of the algorithm. Random perturbations drive the algorithm farther away from the true optimal model. However, we showed by adding graph-homomorphic perturbations, which exploit the graph structure, performance can be recovered with guaranteed privacy. We also showed that using dependent perturbations does not result in the same trade-off between privacy and efficiency. In federated learning, we proved that sharing perturbed gradients versus perturbed models significantly reduces the effect of the added noise on the model utility. Thus, we no longer have to choose what to prioritize, and instead, we can have both a highly privatized algorithm with a good model utility.

Availability of data and materials

The generated data is not available since it is randomly generated everytime the experiment is ran. The Avazu dataset used as a real world example can be found at http://www.csie.ntu.edu.tw/-cj1in/libsvmtools/.

Abbreviations

GFL:: Graph Federated Learning
FL:: Federated Learning
FedAvg:: Federated Averaging
SGD:: Stochastic Gradient Descent
SMC:: Secure Multiparty Computation
MSE:: Mean-square-error
MSD:: Mean-square-deviation

References

H.B. McMahan, E. Moore, D. Ramage, S. Hampson, Communication-efficient learning of deep networks from decentralized data, in Proceedings of the International Conference on Artificial Intelligence and Statistics, vol. 54 (2017), pp. 1273–1282
B. Hitaj, G. Ateniese, F. Perez-Cruz, Deep models under the GAN: information leakage from collaborative deep learning, in Proceedings of ACM SIGSAC Conference on Computer and Communications Security, New York, NY, USA (2017), pp. 603–618
L. Melis, C. Song, E. De Cristofaro, V. Shmatikov, Exploiting unintended feature leakage in collaborative learning, in IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA (2019), pp. 691–706
M. Nasr, R. Shokri, A. Houmansadr, Comprehensive privacy analysis of deep learning: Passive and active white-box inference attacks against centralized and federated learning, in IEEE Symposium on Security and Privacy (SP), San Jose, CA, USA (2019), pp. 739–753
L. Zhu, S. Han, Deep leakage from gradients, in Advances in Neural Information Processing Systems, Vancouver, Canada (2019), pp. 17–31
S.Vlaski, A.H. Sayed, Graph-homomorphic perturbations for private decentralized learning, in Proceedings of the ICASSP, Toronto, Canada (2021), pp. 1–5
L. Liu, J. Zhang, S.H. Song, K.B. Letaief, Client-edge-cloud hierarchical federated learning, in IEEE International Conference on Communications (ICC) (2020), pp. 1–6
E. Rizk, A.H. Sayed, A graph federated architecture with privacy preserving learning, in IEEE International Workshop on Signal Processing Advances in Wireless Communications, Lucca, Italy (2021), pp. 1–5. arxiv:2104.13215
W. Liu, L. Chen, W. Zhang, Decentralized federated learning: balancing communication and computing costs. IEEE Trans. Signal Inf. Process. Over Netw. 8, 131–143 (2022)
Article MathSciNet Google Scholar
B. Wang, J. Fang, H. Li, X. Yuan, Q. Ling, Confederated learning: federated learning with decentralized edge servers. arXiv:2205.14905 (2022)
R.C. Geyer, T. Klein, M. Nabi, Differentially private federated learning: a client level perspective. arXiv:1712.07557 (2017)
R. Hu, Y. Guo, H. Li, Q. Pei, Y. Gong, Personalized federated learning with differential privacy. IEEE Internet Things J. 7(10), 9530–9539 (2020)
Article Google Scholar
A. Triastcyn, B. Faltings, Federated learning with Bayesian differential privacy, in IEEE International Conference on Big Data, Los Angeles, California, USA (2019), pp. 2587–2596
S. Truex, L. Liu, K.-H. Chow, M.E. Gursoy, W. Wei, LDP-FED: federated learning with local differential privacy, in Proceedings of the Third ACM International Workshop on Edge Systems, Analytics and Networking (2020), pp. 61–66
K. Wei, J. Li, M. Ding, C. Ma, H.H. Yang, F. Farokhi, S. Jin, T.Q. Quek, H.V. Poor, Federated learning with differential privacy: algorithms and performance analysis. IEEE Trans. Inf. Forensics Secur. 15, 3454–3469 (2020)
Article Google Scholar
B. Jayaraman, L. Wang, D. Evans, Q. Gu, Distributed learning without distress: privacy-preserving empirical risk minimization, in Advances in Neural Information Processing Systems, vol. 31. Montreal, Canada (2018)
C. Li, P. Zhou, L. Xiong, Q. Wang, T. Wang, Differentially private distributed online learning. IEEE Trans. Knowl. Data Eng. 30(8), 1440–1453 (2018)
Article Google Scholar
J. Zhu, C. Xu, J. Guan, D.O. Wu, Differentially private distributed online algorithms over time-varying directed networks. IEEE Trans. Signal Inf. Process. Over Netw. 4(1), 4–17 (2018)
Article MathSciNet Google Scholar
M.A. Pathak, S. Rane, B. Raj, Multiparty differential privacy via aggregation of locally trained classifiers, in Advances in Neural Information Processing Systems, Vancouver, Canada (2010), pp. 1876–1884
S. Gade, N.H. Vaidya, Private learning on networks. arXiv:1612.05236 (2016)
K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H.B. McMahan, S. Patel, D. Ramage, A. Segal, K. Seth, Practical secure aggregation for privacy-preserving machine learning, in Proceedings of ACM SIGSAC Conference on Computer and Communications Security, New York, USA (2017), pp. 1175–1191
A. Gascón, P. Schoppmann, B. Balle, M. Raykova, J. Doerner, S. Zahur, D. Evans, Privacy-preserving distributed linear regression on high-dimensional data. Proc. Priv. Enhanc. Technol. 2017(4), 345–364 (2017)
Google Scholar
P. Mohassel, Y. Zhang, SecureML: a system for scalable privacy-preserving machine learning, in IEEE Symposium on Security and Privacy (SP), San Jose, CA, USA (2017), pp. 19–38
V. Nikolaenko, U. Weinsberg, S. Ioannidis, M. Joye, D. Boneh, N. Taft, Privacy-preserving ridge regression on hundreds of millions of records, in IEEE Symposium on Security and Privacy, Berkeley, CA, USA (2013), pp. 334–348
W. Zheng, R.A. Popa, J.E. Gonzalez, I. Stoica, Helen: Maliciously secure coopetitive learning for linear models, in IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA (2019), pp. 724–738
Y. Ishai, E. Kushilevitz, R. Ostrovsky, A. Sahai, Cryptography with constant computational overhead, in Proceedings Annual ACM Symposium on Theory of Computing, Victoria British Columbia Canada (2008), pp. 433–442
I. Damgård, Y. Ishai, M. Krøigaard, Perfectly secure multiparty computation and the computational overhead of cryptography, in Annual International Conference on the Theory and Applications of Cryptographic Techniques, France (2010), pp. 445–465
E. Rizk, S. Vlaski, A.H. Sayed, Federated learning under importance sampling (2020). arXiv:2012.07383
K. Avazu, Avazu’s Click-Through Rate Prediction (2014). http://www.csie.ntu.edu.tw/-cj1in/libsvmtools/

Download references

Acknowledgements

Not applicable.

Funding

Not applicable.

Author information

Authors and Affiliations

School of Engineering, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
Elsa Rizk & Ali H. Sayed
Department of Electrical and Electronic Engineering, Imperial College London, London, UK
Stefan Vlaski

Authors

Elsa Rizk
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Vlaski
View author publications
You can also search for this author in PubMed Google Scholar
Ali H. Sayed
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

ER is the first author and the corresponding author of this paper. Her main contributions include the mathematical analysis, derivation, simulations and writing. SV is the second author, and his contribution is conceptualization and reviewing. AHS is the third author, and his contribution is conceptualization, reviewing and supervision. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Elsa Rizk.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Secondary result on individual MSE performance

We first introduce the following theorem, which will be used to bound the network disagreement. We loosely bound the individual MSE for each federated unit. A tighter bound can be found, however, it is not needed.

Theorem 4

(Individual MSE convergence) Under Assumptions 1, 2 and 3, the individual models converge to the optimal model $w^o$ exponentially fast for a sufficiently small step-size:

$$\begin{aligned}&\text{ col }\{{\mathbb {E}} \Vert {\widetilde{{\varvec{w}}}}_{p,i} \Vert ^2 \}_{p=1}^P \\&\quad \preceq \Lambda ^i \text{ col }\{{\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{p,0} \Vert ^2 \}_{p=1}^P + \sum _{j=0}^i \Lambda ^j \text{ col }\{\mu ^2\sigma _{s,p}^2+ O(\mu ^2)\xi ^2 + O(\mu ^3)\sigma _{q,p}^2 + \sigma _{g,p}^2 \}_{p=1}^P , \end{aligned}$$

(70)

where $\preceq$ is the elementwise comparison, $\Lambda$ is a diagonal matrix with the $p^{th}$ entry given by $\lambda _p = \sqrt{1-2\nu \mu + \delta ^2\mu ^2} + \beta _{s,p}^2\mu ^2 + O(\mu ^2)\in (0,1)$, $\sigma _{q,p}^2$ the average of $\sigma _{q,p,k}^2$, and $\sigma _{g,p}^2$ is the total variance introduced by the noise added at server p. Then, taking the limit of i to infinity:

$$\begin{aligned}&\limsup _{i\rightarrow \infty } \text{ col }\left\{ {\mathbb {E}} \Vert {\widetilde{{\varvec{w}}}}_{p,i} \Vert ^2 \right\} _{p=1}^P \\&\quad \preceq (I - \Lambda )^{-1} \text{ col }\left\{ \mu ^2\sigma _{s,p}^2 + O(\mu ^2)\xi ^2 + O(\mu ^3)\sigma _{q,p}^2 + \sigma _{g,p}^2 \right\} _{p=1}^P. \end{aligned}$$

(71)

Proof

Focusing on the error of a single server p, we can verify that:

$$\begin{aligned}&{\mathbb {E}} \{\Vert {\widetilde{{\varvec{w}}}}_{p,i}\Vert ^2 | {\mathcal {F}}_{i-1} \} \\&{\mathop {=}\limits ^{(a)}} {\mathbb {E}} \Bigg \{\bigg \Vert \sum _{m \in {\mathcal {N}}_p} a_{pm} \big ( {\widetilde{{\varvec{w}}}}_{m,i-1} + \mu \nabla _{w^{\textsf{T}}} J_m({\varvec{w}}_{m,i-1}) + \mu {\varvec{q}}_{m,i} \big ) \bigg \Vert ^2 \bigg |{\mathcal {F}}_{i-1} \Bigg \} \\&\quad + \mu ^2 {\mathbb {E}}\left\{ \bigg \Vert \sum _{m\in {\mathcal {N}}_p} a_{pm} {\varvec{s}}_{m,i} \bigg \Vert ^2 \bigg | {\mathcal {F}}_{i-1} \right\} +{\mathbb {E}} \left\{ \bigg \Vert \sum _{m \in {\mathcal {N}}_p} \frac{a_{pm}}{L} \sum _{k\in {\mathcal {L}}_{m,i}} {\varvec{g}}_{m,k,i} \bigg \Vert ^2 \bigg | {\mathcal {F}}_{i-1} \right\} \\&\quad + {\mathbb {E}} \left\{ \bigg \Vert \sum _{m\in {\mathcal {N}}_p} a_{pm}{\varvec{g}}_{pm,i} \bigg \Vert ^2 \bigg | {\mathcal {F}}_{i-1} \right\} , \\&{\mathop {\le }\limits ^{(b)}} \sum _{m\in {\mathcal {N}}_p} a_{pm} \Bigg (\frac{1}{\alpha } \Vert {\widetilde{{\varvec{w}}}}_{m,i-1} + \mu \nabla _{w^{\textsf{T}}} J_m({\varvec{w}}_{m,i-1})\Vert ^2 + \frac{\mu ^2}{1-\alpha }\big (O(\mu )\Vert {\widetilde{{\varvec{w}}}}_{m,i-1}\Vert ^2 \\&\quad + O(\mu ) \xi ^2 + O(\mu ^2) \sigma _{q,m}^2 \big ) + \mu ^2 \left( \sigma _{s,m}^2+\beta _{s,m}^2\Vert {\widetilde{{\varvec{w}}}}_{m,i-1}\Vert ^2 \right) + \frac{1}{LK}\sum _{k=1}^K {\mathbb {E}}\Vert {\varvec{g}}_{m,k,i}\Vert ^2 \\ {}&\quad + {\mathbb {E}}\Vert {\varvec{g}}_{pm,i}\Vert ^2 \Bigg ), \\&{\mathop {\le }\limits ^{(c)}} \sum _{m\in {\mathcal {N}}_p}a_{pm} \Bigg ( \bigg ( \frac{1-2\nu \mu + \delta ^2 \mu ^2 }{\alpha } + \beta _{s,m}^2\mu ^2 + \frac{O(\mu ^3)}{1-\alpha } \bigg ) \Vert {\widetilde{{\varvec{w}}}}_{m,i-1}\Vert ^2 + \mu ^2 \sigma _{s,m}^2 \\&\quad + \frac{O(\mu ^3)\xi ^2 + O(\mu ^4)\sigma _{q,m}^2}{1-\alpha } + \frac{1}{LK}\sum _{k=1}^K {\mathbb {E}}\Vert {\varvec{g}}_{m,k,i}\Vert ^2 + {\mathbb {E}}\Vert {\varvec{g}}_{pm,i}\Vert ^2 \Bigg ), \end{aligned}$$

(72)

where we define $\sigma _{q,m}^2$ to be the average of $\sigma _{q,m,k}^2$. Step (a) follows from independence of random variables and the zero-mean of the gradient noise and the added noise, (b) from Jensens’ inequality and the bound on the gradient noise (24) and the incremental noise (37), (c) from $\nu$-strong convexity and $\delta$-Lipschtz continuity. Then, choosing $\alpha = \sqrt{1-2\nu \mu + \delta ^2 \mu ^2} = 1-O(\mu )$ and taking the expectation over the filtration, we get:

$$\begin{aligned} {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{p,i}\Vert ^2&\le \sum _{m\in {\mathcal {N}}_p}a_{pm}\left( \lambda _m {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{m,i-1}\Vert ^2 + \mu ^2\sigma _{s,m}^2 + O(\mu ^2)\xi ^2 + O(\mu ^3)\sigma _{q,m}^2 + \sigma _{g,m}^2 \right) , \end{aligned}$$

(73)

where we introduce the constants $\lambda _m$ and $\sigma _{g,m}^2$:

$$\begin{aligned} \lambda _m \,\overset{\Delta }{=}\,\sqrt{1-2\nu \mu +\delta ^2\mu ^2} + \beta _{s,m}^2 \mu ^2 + O(\mu ^2). \end{aligned}$$

(74)

Next, taking the column vector of every local mean-square-error, we get the following bound in which we drop the indexing from the column vectors:

$$\begin{aligned}&\text{ col }\left\{ {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{p,i}\Vert ^2\right\} \\&\preceq \Lambda A \, \text{ col }\left\{ {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{p,i-1}\Vert ^2\right\} + A \, \text{ col }\left\{ \mu ^2\sigma _{s,p}^2 + \sigma _{g,p}^2+ O(\mu ^2)\xi ^2\right\} + A\,\text{ col }\left\{ O(\mu ^3)\sigma _{q,p}^2 \right\} , \\&\preceq \Lambda ^i A^i \text{ col }\left\{ {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{p,0}\Vert ^2\right\} + \sum _{j=0}^i \Lambda ^j A^j \text{ col }\left\{ \mu ^2\sigma _{s,p}^2 + \sigma _{g,p}^2 \right\} \\ {}&\quad + \Lambda ^jA^j\text{ col }\left\{ O(\mu ^2)\xi ^2+O(\mu ^3)\sigma _{q,p}^2 \right\} , \\&\preceq \Lambda ^i \text{ col }\left\{ {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{p,0}\Vert ^2\right\} + \sum _{j=0}^i \Lambda ^j \text{ col }\left\{ \mu ^2\sigma _{s,p}^2 + \sigma _{g,p}^2 + O(\mu ^2)\xi ^2 + O(\mu ^3)\sigma _{q,p}^2\right\} , \end{aligned}$$

(75)

where we define the diagonal matrix $\Lambda$ with $\lambda _p$ as entries on the diagonal. Then choosing $\mu$ small enough such that $\lambda _p < 1$ for every p, we know the limit of $\Lambda ^i$ as i goes to infinity is zero. Furthermore, if the eigenvalues of $\Lambda$ are less than 1, which they are, then the geometric series converges to $(I - \Lambda )^{-1}$. Thus, we get the desired result. $\square$

Appendix 2: Proof of Lemma 3

Consider the aggregate model vector, i.e., ${\varvec{{ {\mathcal {W}}}}}_{i} \,\overset{\Delta }{=}\,\text{ col }\left\{ {\varvec{w}}_{p,i}\right\} _{p=1}^P$, for which we write the model recursion as:

$$\begin{aligned} {\varvec{{ {\mathcal {W}}}}}_i&= (A \otimes I)^\textsf{T}\Bigg ({\varvec{{ {\mathcal {W}}}}}_{i-1} - \mu \text{ col }\left\{ \nabla _{w^{\textsf{T}}} J_p({\varvec{w}}_{p,i-1}) + {\varvec{s}}_{p,i} + {\varvec{q}}_{p,i}\right\} \\&\quad \left. + \text{ col }\left\{ \frac{1}{L}\sum _{k\in {\mathcal {L}}_{p,i} } {\varvec{g}}_{p,k,i}\right\} \right) + \text {diag}\big ( (A \otimes I)^\textsf{T}{\varvec{{\mathcal {G}}}}_i \big ), \end{aligned}$$

(76)

where ${\varvec{{\mathcal {G}}}}_i$ is a matrix whose entries are the noise ${\varvec{g}}_{pm,i}$, and the diag$(\cdot )$ function extracts the diagonal entries of a matrix and transforms them into a column vector.

Since A is doubly-stochastic, then it admits an eigendecomposition of the form $A = QH Q^\textsf{T}$, with the first eigenvalue equal to 1 and its corresponding eigenvector equal to $\mathbbm {1}/\sqrt{P}$.

Next, we define the extended centroid model ${\varvec{{ {\mathcal {W}}}}}_{c,i} \,\overset{\Delta }{=}\,\left( \frac{1}{P} \mathbbm {1}\mathbbm {1}^\textsf{T}\otimes I\right) {\varvec{{ {\mathcal {W}}}}}_{i}$, and write:

$$\begin{aligned} {\varvec{{ {\mathcal {W}}}}}_i - {\varvec{{ {\mathcal {W}}}}}_{c,i}&= \left( I - \frac{1}{P}\mathbbm {1}\mathbbm {1}^\textsf{T}\otimes I \right) {\varvec{{ {\mathcal {W}}}}}_{i} \\&= \left( (Q^{\textsf{T}}\otimes I ) (Q\otimes I ) - \frac{1}{P}\mathbbm {1}\mathbbm {1}^\textsf{T}\otimes I \right) {\varvec{{ {\mathcal {W}}}}}_{i} \\&= (Q_{\epsilon }^\textsf{T}\otimes I)(Q_{\epsilon } \otimes I) {\varvec{{ {\mathcal {W}}}}}_i \\&= (Q_{\epsilon }^\textsf{T}\otimes I)H_{\epsilon } (Q_{\epsilon } \otimes I) \Bigg ( {\varvec{{ {\mathcal {W}}}}}_{i-1} -\mu \text{ col }\left\{ \nabla _{w^{\textsf{T}}} J_p({\varvec{w}}_{p,i-1}) + {\varvec{s}}_{p,i} + {\varvec{q}}_{p,i}\right\} \\&\quad \left. + (Q_{\epsilon }^\textsf{T}\otimes I )(Q_{\epsilon } \otimes I)\text {diag} \left( (A\otimes I)^\textsf{T}{\varvec{{\mathcal {G}}}}_i \right) + \text{ col }\left\{ \frac{1}{L} \sum _{k \in {\mathcal {L}}_{p,i}} {\varvec{g}}_{p,k,i}\right\} \right) . \end{aligned}$$

(77)

Then, taking the conditional expectation given the past models of $\Vert (Q_{\epsilon } \otimes I){\varvec{{ {\mathcal {W}}}}}_{i}\Vert ^2$, we can split the gradient noise and the added privacy noise from the model and the true gradient. Taking again the expectation over the past data, and then using the sub-multiplicity property of the norm followed by Jensen’s inequality, we have:

$$\begin{aligned}&{\mathbb {E}} \Vert (Q_{\epsilon } \otimes I) {\varvec{{ {\mathcal {W}}}}}_{i}\Vert ^2 \\&\quad \le \Vert H_{\epsilon }\Vert ^2\Bigg ( {\mathbb {E}} \left\| (Q_{\epsilon } \otimes I) {\varvec{{ {\mathcal {W}}}}}_{i-1} - (Q_{\epsilon } \otimes I) \mu \text{ col }\left\{ \nabla _{w^{\textsf{T}}} J_p({\varvec{w}}_{p,i-1}) + {\varvec{q}}_{p,i} \right\} \right\| ^2 \\&\qquad + \mu ^2\Vert Q_{\epsilon } \otimes I\Vert ^2 \sum _{p=1}^P{\mathbb {E}} \Vert {\varvec{s}}_{p,i}\Vert ^2 \left. + \Vert Q_{\epsilon } \otimes I \Vert ^2\sum _{p=1}^P {\mathbb {E}} \left\| \frac{1}{L}\sum _{k\in {\mathcal {L}}_{p,i}} {\varvec{g}}_{p,k,i}\right\| ^2 \right) \\&\qquad + \Vert Q_{\epsilon } \otimes I \Vert ^2 {\mathbb {E}} \Vert \text {diag}\left( (A\otimes I)^\textsf{T}{\varvec{{\mathcal {G}}}}_i \right) \Vert ^2 \\&\quad \le \Vert H_{\epsilon }\Vert ^2 \Bigg ( \frac{1}{\Vert H_{\epsilon }\Vert } {\mathbb {E}} \Vert (Q_{\epsilon } \otimes I){\varvec{{ {\mathcal {W}}}}}_{i-1}\Vert ^2 + \frac{\mu ^2 \Vert Q_{\epsilon } \otimes I \Vert ^2}{1-\Vert H_{\epsilon }\Vert } \sum _{p=1}^P {\mathbb {E}} \Vert \nabla _{w^{\textsf{T}}} J_p({\varvec{w}}_{p,i-1}) + {\varvec{q}}_{p,i}\Vert ^2 \\&\qquad + \mu ^2 \Vert Q_{\epsilon } \otimes I \Vert ^2 \sum _{p=1}^P {\mathbb {E}} \Vert {\varvec{s}}_{p,i}\Vert ^2 \left. + \Vert Q_{\epsilon } \otimes I \Vert ^2 \sum _{p=1}^P {\mathbb {E}} \left\| \frac{1}{L}\sum _{k\in {\mathcal {L}}_{p,i}} {\varvec{g}}_{k,p,i} \right\| ^2 \right) \\&\qquad +\Vert Q_{\epsilon } \otimes I\Vert ^2 {\mathbb {E}}\Vert \text {diag} \left( (A\otimes I)^\textsf{T}{\varvec{{\mathcal {G}}}}_i \right) \Vert ^2 . \end{aligned}$$

(78)

Next, we focus on each individual term. Using Jensen for some constant $\alpha $ and then the Lipschitz condition and the bound on the incremental noise, we can bound the below norm as follows:

$$\begin{aligned} {\mathbb {E}} \Vert \nabla _{w^{\textsf{T}}} J_p({\varvec{w}}_{p,i-1}) + {\varvec{q}}_{p,i}\Vert ^2&\le \frac{2}{\alpha } \left( \delta ^2 {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{p,i-1}\Vert ^2 + \Vert \nabla _{w^{\textsf{T}}} J_p(w^o)\Vert ^2 \right) \\&\quad + \frac{1}{1-\alpha } \left( O(\mu ){\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{p,i-1}\Vert ^2 + O(\mu )\xi ^2 + O(\mu ^2) \sigma _{q,p}^2 \right) . \end{aligned}$$

(79)

Using the bound on the gradient noise (24), we get another ${\mathbb {E}} \Vert {\widetilde{{\varvec{w}}}}_{p,i-1}\Vert ^2$ term, which can be bounded by the result in Theorem 4. Thus, we write:

$$\begin{aligned}&\frac{1}{1-\Vert H_{\epsilon }\Vert } {\mathbb {E}}\Vert \nabla _{w^{\textsf{T}}} J_p({\varvec{w}}_{p,i-1}) + {\varvec{q}}_{p,i}\Vert ^2 + {\mathbb {E}}\Vert {\varvec{s}}_{p,i}\Vert ^2 \\&\quad \le \left( \frac{2\delta ^2 }{\alpha (1- \Vert H_{\epsilon }\Vert ) } +\beta _{s,p}^2 + \frac{O(\mu )}{(1-\alpha )(1-\Vert H_{\epsilon }\Vert )} \right) {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{p,i-1}\Vert ^2 + \frac{ 2\Vert \nabla _{w^{\textsf{T}}} J_p(w^o)\Vert ^2}{\alpha (1-\Vert H_{\epsilon }\Vert )} \\&\qquad + \sigma _{s,p}^2 + \frac{O(\mu )\xi ^2 + O(\mu ^2)\sigma _{q,p}^2}{(1-\alpha )(1-\Vert H_{\epsilon }\Vert )} \\&\quad \le \left( \frac{2\delta ^2 }{\alpha (1- \Vert H_{\epsilon }\Vert )} +\beta _{s,p}^2 + \frac{O(\mu )}{(1-\alpha )(1-\Vert H_{\epsilon }\Vert )} \right) \Bigg ( \lambda _p^i A^i[p] \text{ col }\left\{ {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{p,0}\Vert ^2\right\} \\&\qquad + \sum _{j=0}^{i-1}\lambda _p^j A^j[p] \text{ col }\left\{ \mu ^2 \sigma _{s,p}^2+O(\mu ^2)\xi ^2 + O(\mu ^3)\sigma _{q,p}^2+ \sigma _{g,p}^2\right\} \Bigg ) \\&\qquad + \frac{2\Vert \nabla _{w^{\textsf{T}}} J_p(w^o)\Vert ^2}{\alpha (1-\Vert H_{\epsilon }\Vert ) }+ \sigma _{s,p}^2 + \frac{O(\mu )\xi ^2 + O(\mu ^2)\sigma _{q,p}^2}{(1-\alpha )(1-\Vert H_{\epsilon }\Vert )} . \end{aligned}$$

(80)

The noise term can be witten in a more compact way, $ \Vert Q_{\epsilon } \otimes I \Vert ^2 \sum \limits _{p=1}^P \sigma _{g,p}^2.$ Thus, putting everything together, we get:

$$\begin{aligned}&{\mathbb {E}}\Vert (Q_{\epsilon } \otimes I) {\varvec{{ {\mathcal {W}}}}}_i\Vert ^2 \\&\quad \le \Vert H_{\epsilon }\Vert {\mathbb {E}} \Vert (Q_{\epsilon } \otimes I) {\varvec{{ {\mathcal {W}}}}}_{i-1}\Vert ^2 + \mu ^2 \Vert Q_{\epsilon } \otimes I \Vert ^2 \Vert H_{\epsilon }\Vert ^2 \sum _{p=1}^P \Bigg ( \left( \frac{2\delta ^2}{\alpha (1-\Vert H_{\epsilon }\Vert ) } \right. \\&\qquad \left. + \beta _{s,p}^2 + \frac{O(\mu )}{(1-\alpha )(1-\Vert H_{\epsilon }\Vert )}\right) \Bigg ( \lambda _p^i A^i[p] \text{ col }\left\{ {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{p,0}\Vert ^2\right\} + \sum _{j=0}^{i-1}\lambda _p^j A^j[p] \\&\qquad \times \text{ col }\left\{ \mu ^2 \sigma _{s,p}^2 + O(\mu ^2)\xi ^2 + O(\mu ^3)\sigma _{q,p}^2 + \sigma _{g,p}^2 \right\} \Bigg )+ \frac{2\Vert \nabla _{w^{\textsf{T}}} J_p(w^o)\Vert ^2}{\alpha (1-\Vert H_{\epsilon }\Vert )} \\&\qquad + \sigma _{s,p}^2 + \frac{O(\mu )\xi ^2 + O(\mu ^2)\sigma _{q,p}^2}{(1-\alpha )(1-\Vert H_{\epsilon }\Vert )}\Bigg ) + \Vert Q_{\epsilon } \otimes I \Vert ^2 \sum _{p=1}^P \sigma _{g,p}^2 \\&\quad \le \Vert H_{\epsilon }\Vert ^i {\mathbb {E}} \Vert (Q_{\epsilon } \otimes I) {\varvec{{ {\mathcal {W}}}}}_{0}\Vert ^2 + \sum _{j'=0}^{i-1} \Vert H_{\epsilon }\Vert ^{j'+2} \Vert Q_{\epsilon } \otimes I \Vert ^2 \Bigg \{ \mu ^2 \sum _{p=1}^P \Bigg ( \left( \frac{2\delta ^2}{\alpha (1-\Vert H_{\epsilon }\Vert ) } \right. \\&\qquad \left. + \beta _{s,p}^2 + \frac{O(\mu )}{(1-\alpha )(1-\Vert H_{\epsilon }\Vert )}\right) \Bigg ( \lambda _p^{j'} A^{j'}[p] \text{ col }\left\{ {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{p,0}\Vert ^2\right\} + \sum _{j=0}^{j'-1}\lambda _p^j A^j[p] \\&\qquad \times \text{ col }\left\{ \mu ^2 \sigma _{s,p}^2 +O(\mu ^2)\xi ^2 + O(\mu ^3)\sigma _{q,p}^2 + \sigma _{g,p}^2\right\} \Bigg ) + \frac{2\Vert \nabla _{w^{\textsf{T}}} J_p(w^o)\Vert ^2}{\alpha (1-\Vert H_{\epsilon }\Vert )} \\&\qquad + \sigma _{s,p}^2 + \frac{O(\mu )\xi ^2 + O(\mu ^2)\sigma _{q,p}^2}{(1-\alpha )(1-\Vert H_{\epsilon }\Vert )}\Bigg ) + \frac{1}{\Vert H_{\epsilon }\Vert ^2 } \sum _{p=1}^P \sigma _{g,p}^2 \Bigg \}. \end{aligned}$$

(81)

Going back to the network disagreement, it is bounded by the above bound multiplied by $\Vert Q_{\epsilon }^\textsf{T}\otimes I\Vert ^2/P$. If we were to drive i to infinity, since $\Vert H_{\epsilon }\Vert = \iota _2< 1$, with $\iota _2$ being the second eigenvalue of A, and choosing $\alpha = \iota _2$ we would have:

$$\begin{aligned}&\limsup _{i\rightarrow \infty }\frac{1}{P}\sum _{p=1}^P {\mathbb {E}}\Vert {\varvec{w}}_{p,i} - {\varvec{w}}_{c,i}\Vert ^2 \\&\quad \le \frac{\Vert Q_{\epsilon } \otimes I \Vert ^4 \iota _2^2 }{P} \Bigg \{ \mu ^2 \sum _{p=1}^P \Bigg ( \left( \frac{2\delta ^2}{\iota _2(1- \iota _2)} + \beta _{s,p}^2 + \frac{O(\mu )}{(1-\iota _2)^2}\right) \sum _{j'=0}^{\infty }\iota _2^{j'}\sum _{j=0}^{j'-1} \lambda _p^jA^j[p] \\&\qquad \times \text{ col }\left\{ \mu ^2 \sigma _{s,p}^2 + O(\mu ^2)\xi ^2 + O(\mu ^3)\sigma _{q,p}^2+ \sigma _{g,p}^2\right\} + \frac{2\Vert \nabla _{w^{\textsf{T}}} J_p(w^o)\Vert ^2}{\iota _2(1-\iota _2)^2 } \\&\qquad + \frac{\sigma _{s,p}^2 }{1-\iota _2}+ \frac{O(\mu )\xi ^2 + O(\mu ^2)\sigma _{q,p}^2}{(1-\iota _2)^3}\Bigg ) + \frac{1}{(1-\iota _2)\iota _2^2}\sum _{p=1}^P \sigma _{g,p}^2 \Bigg \} \\&\quad \le \frac{\iota _2^2}{P} \Bigg \{ \mu ^2 \sum _{p=1}^P \Bigg ( \left( \frac{2\delta ^2}{\iota _2(1- \iota _2)} + \beta _{s,p}^2 + \frac{O(\mu )}{(1-\iota _2)^2}\right) \\&\qquad \times \sum _{m\in {\mathcal {N}}_p} \frac{\iota _2(\mu ^2 \sigma _{s,m}^2 + O(\mu ^2)\xi ^2 + O(\mu ^3)\sigma _{q,m}^2+ \sigma _{g,m}^2) }{1-\iota _2\lambda _p a_{pm}} + \frac{2\Vert \nabla _{w^{\textsf{T}}} J_p(w^o)\Vert ^2}{\iota _2 (1-\iota _2)^2 } \\&\qquad + \frac{\sigma _{s,p}^2 }{1-\iota _2}+ \frac{O(\mu )\xi ^2 + O(\mu ^2)\sigma _{q,p}^2}{(1-\iota _2)^3}\Bigg )+ \frac{1}{(1-\iota _2)\iota _2^2}\sum _{p=1}^P \sigma _{g,p}^2 \Bigg \} \\&\quad = \frac{\iota _2^2}{P(1-\iota _2)} \sum _{p=1}^P \mu ^2 \sigma _{s,p}^2 + \frac{1}{\iota _2^2}\sigma _{g,p}^2 + O(\mu )\sigma _{g,p}^2 + O(\mu ^3). \end{aligned}$$

(82)

Appendix 3: Proof of Theorem 1

First taking the conditional mean of the $\ell _2$-norm of the centroid error given the past models, splits the mean into three independent terms: the centralized recursion, the gradient noise and the added noise. Then, taking the expectation again, we get:

$$\begin{aligned}&{\mathbb {E}} \Vert {\widetilde{{\varvec{w}}}}_{c,i}\Vert ^2 \\&\quad = {\mathbb {E}} \bigg \Vert {\widetilde{{\varvec{w}}}}_{c,i-1} + \mu \frac{1}{P}\sum _{p=1}^P \nabla _{w^{\textsf{T}}} J_p({\varvec{w}}_{p,i-1}) + \mu {\varvec{q}}_i \bigg \Vert ^2 + \mu ^2 {\mathbb {E}} \Vert {\varvec{s}}_i\Vert ^2+ {\mathbb {E}} \Vert {\varvec{g}}_{c,i}\Vert ^2 \\&\quad {\mathop {\le }\limits ^{(a)}} \frac{1}{\alpha ^2}{\mathbb {E}} \bigg \Vert {\widetilde{{\varvec{w}}}}_{c,i-1} + \mu \frac{1}{P}\sum _{p=1}^P \nabla _{w^{\textsf{T}}} J_p({\varvec{w}}_{c,i-1}) \bigg \Vert ^2 + \frac{\mu ^2}{1-\alpha } {\mathbb {E}}\Vert {\varvec{q}}_i \Vert ^2 \\&\qquad + \frac{\delta ^2\mu ^2 }{\alpha (1-\alpha )P}\sum _{p=1}^P {\mathbb {E}} \Vert {\varvec{w}}_{p,i-1} - {\varvec{w}}_{c,i-1}\Vert ^2 + \mu ^2{\mathbb {E}} \Vert {\varvec{s}}_i\Vert ^2+ {\mathbb {E}} \Vert {\varvec{g}}_{c,i}\Vert ^2 \\&\quad {\mathop {\le }\limits ^{(b)}} \left( \frac{1}{\alpha ^2}(1-2\nu \mu + \delta ^2 \mu ^2)+ \beta _s^2\mu ^2 + \frac{O(\mu ^3)}{1-\alpha } \right) {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{c,i-1}\Vert ^2 +\mu ^2 \sigma _s^2+ {\mathbb {E}}\Vert {\varvec{g}}_{c,i}\Vert ^2 \\&\qquad + \left( \frac{\delta ^2}{\alpha (1-\alpha )} +\frac{O(\mu ^3)}{1-\alpha }+ \beta _{s,max}^2 \right) \frac{\mu ^2}{P}\sum _{p=1}^P {\mathbb {E}}\Vert {\varvec{w}}_{p,i-1} - {\varvec{w}}_{c,i-1}\Vert ^2 \\&\qquad + \frac{O(\mu ^3)\xi ^2 + O(\mu ^4)\sigma _{q}^2}{1-\alpha }, \end{aligned}$$

(83)

where inequality (a) follows from Jensen with constant $\alpha \in (0,1)$ and Lipshcitz, and (b) from applying Lemma 1. Then, choosing $\alpha = \root 4 \of {1-2\nu \mu + \delta ^2 \mu ^2} = 1 - O(\mu )$, the bound becomes:

$$\begin{aligned} {\mathbb {E}} \Vert {\widetilde{{\varvec{w}}}}_{c,i}\Vert ^2&\le \lambda _c {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{c,i-1}\Vert ^2 + \mu ^2 \sigma _s^2 + {\mathbb {E}} \Vert {\varvec{g}}_{c,i}\Vert ^2 + O(\mu ^{2})\xi ^2 + O(\mu ^{3})\sigma _{q}^2 \\&\quad + \frac{O(\mu )}{P} \sum _{p=1}^P{\mathbb {E}}\Vert {\varvec{w}}_{p,i-1}-{\varvec{w}}_{c,i-1}\Vert ^2. \end{aligned}$$

(84)

Finally, using the result on the network disagreement, recusrively bounding the error, and taking the limit of i, we get the final result:

$$\begin{aligned} \limsup _{i \rightarrow \infty } {\mathbb {E}} \Vert {\widetilde{{\varvec{w}}}}_{c,i}\Vert ^2&\le \frac{\mu ^2 \sigma _s^2 + {\mathbb {E}}\Vert {\varvec{g}}_{c}\Vert ^2 + O(\mu ^{2})\xi ^2 + O(\mu ^{3})\sigma _q^2}{1-\lambda _c}+ \sum _{p=1}^PO(1 ) \sigma _{g,p}^2 + O(\mu ). \end{aligned}$$

(85)

Appendix 4: Secondary result involving a bound on ${\mathbb {E}}\Vert {\widetilde{{\varvec{{ {\mathcal {W}}}}}}}_{i}\Vert ^2$

To show the sensitivity of the algorithm is bounded with high probability, we require a bound on ${\mathbb {E}}\Vert {\widetilde{{\varvec{{ {\mathcal {W}}}}}}}_{i}\Vert ^2$ and ${\mathbb {E}}\Vert {\widetilde{{\varvec{{ {\mathcal {W}}}}}}}'_{i}\Vert ^2$. From Theorem 4 we can bound the individual errors by:

$$\begin{aligned} {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{p,i}\Vert ^2&\le \lambda _p {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{p,i-1}\Vert ^2 + \mu ^2 \sigma _{s,p}^2 + O(\mu ^2)\xi ^2 + O(\mu ^3)\sigma _{q,p}^2 + \sigma _{g,p}^2 \\&\le \lambda _{\max } {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{p,i-1}\Vert ^2 + \mu ^2 \sigma _{s,p}^2 + O(\mu ^2)\xi ^2 + O(\mu ^3)\sigma _{q,p}^2 + \sigma _{g,p}^2 \\&\le \lambda ^i_{\max } {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{p,0}\Vert ^2 + \frac{1-\lambda ^i_{\max }}{1-\lambda _{\max }} \left( \mu ^2 \sigma _{s,p}^2 + O(\mu ^2)\xi ^2 + O(\mu ^3)\sigma _{q,p}^2 + \sigma _{g,p}^2 \right) \\&\le \lambda ^i_{\max } {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{p,0}\Vert ^2 + O(\mu ) + O(\mu ^{-1}), \end{aligned}$$

(86)

where $\lambda _{\max } = \max _p \lambda _p$. Then, ${\mathbb {E}}\Vert {\widetilde{{\varvec{{ {\mathcal {W}}}}}}}_i\Vert ^2$ can be bounded as follows:

$$\begin{aligned} {\mathbb {E}}\Vert {\widetilde{{\varvec{{ {\mathcal {W}}}}}}}_i\Vert ^2&= \sum _{p=1}^P {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{p,i}\Vert ^2 \\&\le \sum _{p=1}^P \lambda ^i_{\max } {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{p,0}\Vert ^2 + O(\mu ) + O(\mu ^{-1}) \\&= \lambda _{\max }^i {\mathbb {E}}\Vert {\widetilde{{\varvec{{ {\mathcal {W}}}}}}}_0\Vert ^2 + O(\mu ) + O(\mu ^{-1}). \end{aligned}$$

(87)

It follows that for some constants B and $B'$, the probability that ${\mathbb {E}}\Vert {\widetilde{{\varvec{{ {\mathcal {W}}}}}}}_i\Vert$ and ${\mathbb {E}}\Vert {\widetilde{{\varvec{{ {\mathcal {W}}}}}}}'_i\Vert$ are unbounded can be bounded using Markov’s inequality by:

$$\begin{aligned} {\mathbb {P}}(\Vert {\widetilde{{\varvec{{ {\mathcal {W}}}}}}}_i\Vert \ge B)&\le \frac{{\mathbb {E}}\Vert {\widetilde{{\varvec{{ {\mathcal {W}}}}}}}_i\Vert ^2}{B^2} \\&\le \frac{\lambda _{\max }^i {\mathbb {E}}\Vert {\widetilde{{\varvec{{ {\mathcal {W}}}}}}}_0\Vert ^2 + O(\mu ) + O(\mu ^{-1})}{B^2}, \end{aligned}$$

(88)

and similarly for ${\mathbb {P}}(\Vert {\widetilde{{\varvec{{ {\mathcal {W}}}}}}}'_i\Vert \ge B')$.

Appendix 5: Proof of Theorem 3

To evaluate the probability distribution in Definition 1, we note that the randomness of the models ${\varvec{\psi }}_{p,j}$ arises from the subsampling of the data for the calculation of the stochastic gradient at each iteration. Thus, given the subsampled dataset, the models are now deterministic and since the added noises ${\varvec{g}}_{pm,j}$ are Laplacian random variables, the distribution of the added noise over the neighbourhood of agent p and over the iterations is given by:

$$\begin{aligned} f\left( \left\{ \left\{ {\varvec{\psi }}_{p,j} + {\varvec{g}}_{pm,j} \right\} _{m\in {\mathcal {N}}_p \setminus \{p\} } \right\} _{j=0}^i \right)&= f({\varvec{y}}_0) f({\varvec{y}}_1|{\varvec{y}}_0) \cdots f({\varvec{y}}_i|{\varvec{y_0}},\cdots ,{\varvec{y}}_{i-1}) \\&=\prod _{j=0}^i \frac{1}{\sqrt{2}\sigma _g} \exp \bigg ( -\frac{\sqrt{2}}{\sigma _g} \Vert {\varvec{\psi }}_{p,j} + {\varvec{g}}_{p,j} \Vert \bigg ) \\&= \frac{1}{\sqrt{2}\sigma _g} \exp \bigg ( -\frac{\sqrt{2}}{\sigma _g} \sum _{j=0}^i \Vert {\varvec{\psi }}_{p,j} + {\varvec{g}}_{p,j}\Vert \bigg ), \end{aligned}$$

(89)

where ${\varvec{y}}_{j} = \left\{ {\varvec{\psi }}_{p,j} + {\varvec{g}}_{pm,j} \right\} _{m\in {\mathcal {N}}_p {\setminus } \{p\} }$ and the ratio in Definition 1 is bounded with high probability:

$$\begin{aligned} &\exp \left(-\frac{\sqrt{2}}{\sigma _g} \sum_{j=0}^i \Vert {\varvec{\psi }}_{p,j} + {\varvec{g}}_{p,j}\Vert - \Vert {\varvec{\psi }}'_{p,j} + {\varvec{g}}_{p,j}\Vert \right)\\ &\quad \le \exp \left( \frac{\sqrt{2}}{\sigma _g} \sum _{j=0}^i \Vert {\varvec{\psi }}_{p,j} - {\varvec{\psi }}'_{p,j} \Vert \right)\\ &\quad \le \exp \left(\frac{\sqrt{2}}{\sigma _g} \sum _{j=0}^i \Delta (j) \right) \\ &\quad \le \exp \left( \frac{\sqrt{2}}{\sigma _g} \sum _{j=0}^i (B + B' + \Vert w^o - w^{{\prime}o}\Vert ) \right)\\ &\quad = \exp \left(\frac{\sqrt{2}}{\sigma _g } (B + B' + \Vert w^o - w^{{\prime}o}\Vert)(i+1) \right), \end{aligned}$$

(90)

where the inequalities follow from the triangle inequality and the bound on the sensitivity of the algorithm.

Appendix 6: Proof of Theorem 2

We start by writing the error recursion:

$$\begin{aligned} {\widetilde{{\varvec{w}}}}_{1,i}&= {\widetilde{{\varvec{w}}}}_{1,i-1} + \frac{\mu }{K}\sum _{k=1}^K \nabla _{w^{\textsf{T}}} J_{1,k}({\varvec{w}}_{1,i-1}) + \mu {\varvec{s}}_{1,i} + \mu {\varvec{q}}_{1,i} + \frac{\mu }{L}\sum _{k \in {\mathcal {L}}_{1,i}} {\varvec{g}}_{1,k,i}, \end{aligned}$$

(91)

where we introduce the gradient noise ${\varvec{s}}_{1,i}$ and the incremental noise ${\varvec{q}}_{1,i}$:

$$\begin{aligned} {\varvec{s}}_{1,i}&= \frac{1}{L}\sum _{k\in {\mathcal {L}}_{1,i}} \widehat{\nabla _{w^{\textsf{T}}} J_{1,i}}({\varvec{w}}_{1,i-1}) - \frac{1}{K}\sum _{k=1}^K \widehat{\nabla _{w^{\textsf{T}}} J_{1,k}}({\varvec{w}}_{1,i-1}), \end{aligned}$$

(92)

$$\begin{aligned} {\varvec{q}}_{1,i}&= \frac{1}{L}\sum _{k\in {\mathcal {L}}_{1,i}} \frac{1}{E_{1,k}}\sum _{e=1}^{E_{1,k}} \widehat{\nabla _{w^{\textsf{T}}} J_{1,k}}({\varvec{w}}_{1,k,e-1}) - \widehat{\nabla _{w^{\textsf{T}}} J_{1,k}}({\varvec{w}}_{1,i-1}). \end{aligned}$$

(93)

We have already shown in previous work that the gradient noise is zero-mean and has bounded second order-moment [28, Lemma 1], while the incremental noise has bounded second order-moment [28, Lemma 5]:

$$\begin{aligned} {\mathbb {E}} \{\Vert {\varvec{s}}_{1,i}\Vert ^2 | {\mathcal {F}}_{i-1}\}&\le \beta _{s,1}^2 \Vert {\widetilde{{\varvec{w}}}}_{1,i-1}\Vert ^2 + \sigma _{s,1}^2, \end{aligned}$$

(94)

$$\begin{aligned} {\mathbb {E}} \Vert {\varvec{q}}_{1,i}\Vert ^2&\le O(\mu ){\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{1,i-1}\Vert ^2 + O(\mu )\xi ^2 + O(\mu ^2)\sigma _{q,1}^2, \end{aligned}$$

(95)

where the constants $\beta _{s,1}^2, \sigma _{s,1}^2, \sigma _{q,1}^2$ are given by:

$$\begin{aligned} \beta _{s,1}^2&= \frac{6\delta ^2}{L}\left( 1 + \frac{1}{K}\sum _{k=1}^K \frac{1}{E_{1,k}} \right) , \end{aligned}$$

(96)

$$\begin{aligned} \sigma _{s,1}^2&= \frac{1}{LK}\sum _{k=1}^K \left( \frac{12}{E_{1,k}} + 3\right) \frac{1}{N_{1,k}}\sum _{n=1}^{N_{1,k}} \Vert \nabla _{w^{\textsf{T}}} Q_{1,k}(w^o;x_{1,k,n})\Vert ^2, \end{aligned}$$

(97)

$$\begin{aligned} \sigma _{q,1}^2&= \frac{3}{K}\sum _{k=1}^K \sum _{n=1}^{N_{1,k}}\Vert \nabla _{w^{\textsf{T}}} Q_{1,k}(w^o_{1,k};x_{1,k,n})\Vert ^2. \end{aligned}$$

(98)

Taking the conditional mean of the $\ell _2$-norm of the error, we can split the noise term from the rest and then apply Jensen’s inequality with some constant $\alpha \in (0,1)$:

$$\begin{aligned} {\mathbb {E}} \{ \Vert {\widetilde{{\varvec{w}}}}_{1,i}\Vert ^2 | {\mathcal {F}}_{i-1}, {\mathcal {L}}_{1,i} \}&= {\mathbb {E}} \Bigg \{ \Bigg \Vert {\widetilde{{\varvec{w}}}}_{1,i-1} + \frac{\mu }{K}\sum _{k=1}^K \nabla _{w^{\textsf{T}}} J_{1,k}({\varvec{w}}_{1,i-1}) + \mu {\varvec{s}}_{1,i} \\&\quad + \mu {\varvec{q}}_{1,i} \Bigg \Vert ^2 \Bigg | {\mathcal {F}}_{i-1}, {\mathcal {L}}_{1,i} \Bigg \} + \frac{\mu ^2}{L^2}\sum _{k \in {\mathcal {L}}_{1,i}}{\mathbb {E}}\Vert {\varvec{g}}_{1,k,i}\Vert ^2 \\&\le \frac{1}{\alpha } \left\| {\widetilde{{\varvec{w}}}}_{i-1} + \frac{\mu }{K}\sum _{k=1}^K \nabla _{w^{\textsf{T}}} J_{1,k}({\varvec{w}}_{1,i-1})\right\| ^2 \\&\quad + \frac{\mu ^2}{\alpha } {\mathbb {E}} \{\Vert {\varvec{s}}_{1,i}\Vert ^2 | {\mathcal {F}}_{i-1}, {\mathcal {L}}_{1,i}\} + \frac{\mu ^2}{L}\sigma _{g,1}^2 \\&\quad + \frac{\mu ^2}{1-\alpha } {\mathbb {E}} \{\Vert {\varvec{q}}_{1,i}\Vert ^2 | {\mathcal {F}}_{i-1}, {\mathcal {L}}_{1,i}\} . \end{aligned}$$

(99)

Using strong convexity and Lipschitz continuity of the functions we can bound the first term as:

$$\begin{aligned} \left\| {\widetilde{{\varvec{w}}}}_{1,i-1}+\frac{\mu }{K}\sum _{k=1}^K \nabla _{w^{\textsf{T}}} J_{1,k}(w_{1,i-1}) \right\| ^2&\le (1-2\nu \mu + \delta ^2\mu ^2)\Vert {\widetilde{{\varvec{w}}}}_{1,i-1}\Vert ^2. \end{aligned}$$

(100)

Then, taking the expectations again over the past models and the selected agents, and using the bound on the gradient noise and incremental noise:

$$\begin{aligned} {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{1,i}\Vert ^2&\le \left( \frac{1-2\nu \mu + (\beta _{s,1}^2 +\delta ^2)\mu ^2}{\alpha } + \frac{O(\mu ^3)}{1-\alpha }\right) {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{1,i-1} \Vert ^2 + \frac{\mu ^2}{\alpha } \sigma _{s,1}^2 \\&\quad + \frac{O(\mu ^3)\xi ^2 + O(\mu ^4)\sigma _{q,1}^2}{1-\alpha } + \frac{\mu ^2}{L}\sigma _{g,1}^2. \end{aligned}$$

(101)

Then, recursively bounding the error with $\alpha = \sqrt{1-2\nu \mu + (\beta _{s,1}^2+\delta ^2)\mu ^2}$:

$$\begin{aligned} {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{1,i}\Vert ^2&\le \lambda ^i{\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{1,0}\Vert ^2 + \frac{1-\lambda ^i}{1-\lambda } \bigg ( O(\mu ^2)\sigma _{s,1}^2 + O(\mu ^2)\xi ^2 + \frac{\mu ^2}{L}\sigma _{g,1}^2 + O(\mu ^3)\sigma _{q,1}^2 \bigg ), \end{aligned}$$

(102)

and taking the limit of i:

$$\begin{aligned} \limsup _{i\rightarrow \infty } {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{1,i}\Vert ^2 \le&O(\mu ) (\sigma _{s,1}^2 + \xi ^2 + \sigma _{g,1}^2) + O(\mu ^2)\sigma _{q,1}^2. \end{aligned}$$

(103)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Rizk, E., Vlaski, S. & Sayed, A.H. Privatized graph federated learning. EURASIP J. Adv. Signal Process. 2023, 87 (2023). https://doi.org/10.1186/s13634-023-01049-4

Download citation

Received: 16 January 2023
Accepted: 15 August 2023
Published: 25 August 2023
DOI: https://doi.org/10.1186/s13634-023-01049-4

Privatized graph federated learning

Abstract

1 Introduction

2 Graph federated architecture

3 Performance analysis

3.1 Modeling conditions

Assumption 1

Assumption 2

Assumption 3

3.2 Network centroid convergence

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

Theorem 1

Proof

3.3 Graph-homomorphic perturbations

Corollary 1

Proof

3.4 Sharing gradients as opposed to weight estimates

Theorem 2

Proof

4 Privacy analysis

Definition 1

Theorem 3

Proof

5 Experimental analysis

5.1 Regression

5.2 Privatized federated learning

5.3 Classification

6 Conclusion

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Appendices

Appendix 1: Secondary result on individual MSE performance

Theorem 4

Proof

Appendix 2: Proof of Lemma 3

Appendix 3: Proof of Theorem 1

Appendix 4: Secondary result involving a bound on \({\mathbb {E}}\Vert {\widetilde{{\varvec{{ {\mathcal {W}}}}}}}_{i}\Vert ^2\)

Appendix 5: Proof of Theorem 3

Appendix 6: Proof of Theorem 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords