Privatized graph federated learning

Federated learning is a semi-distributed algorithm, where a server communicates with multiple dispersed clients to learn a global model. The federated architecture is not robust and is sensitive to communication and computational overloads due to its one-master multi-client structure. It can also be subject to privacy attacks targeting personal information on the communication links. In this work, we introduce graph federated learning, which consists of multiple federated units connected by a graph. We then show how graph-homomorphic perturbations can be used to ensure the algorithm is differentially private on the server level. While on the client level, we show that improvement in the differentially private federated learning algorithm can be attained through the addition of random noise to the updates, as opposed to the models. We conduct both convergence and privacy theoretical analyses and illustrate performance by means of computer simulations.


Introduction
Federated learning (FL) [1] is one particular distributed structure where users no longer need to send their data to a server for training.Instead, data remains local, and training happens in collaboration between different clients and the server.Compared to a fully decentralized solution, communication occurs between the server and the clients (or agents), instead of directly between the agents themselves.Such a solution is advantageous in the sense that users no longer need to worry about sharing their data with an unknown party, and the high cost of sending all their raw data is eliminated.In this way, the data stays locally safe on a user's device, and no extra communication cost is incurred for transferring the data remotely.However, such a distributed architecture is not robust to communication failures and computational overloads, nor it is immune to privacy attacks when agents are required to share their local updates.In standard FL, millions of users can be connected to one server at a time.This means one server will need to be responsible for the communication with all clients with significant computational burden, thus rendering the system susceptible to communication failures.Furthermore, whether clients send their gradient updates or their local models, information about their data can be inferred from the exchanges and leaked [2][3][4][5].Consider for instance the logistic risk; the gradient of the loss function is a constant multiple of the feature vector.Thus, even though the actual data samples are not sent to the server, information about them can still be inferred from the gradient updates or the models.
These considerations motivate us to propose an architecture for federated learning with privacy guarantees.In particular, we introduce the graph federated architecture, which consists of multiple servers, and we privatize the algorithm by ensuring the communication occuring between the servers and the clients is secure.Graph-homomorphic perturbations, which were initially introduced in [6], focus on the communication between servers.They are based on adding correlated noise to the messages sent between servers such that the noise cancels out if we were to take the average of all messages across all servers.As for the privatization between the clients and their servers, we share noisy updates as opposed to models.The two protocols make sure the effect of the added noise is reduced.
Other works have also contributed to addressing the same challenges we are considering in this work, albeit differently.For example, the work [7] introduces a hierarchical architecture, where it is assumed there are multiple servers connected in a tree structure.Such a solution still has one main server and thus faces the same robustness problem as FL.The graph federated learning architecture in this work (and which appeared in the earlier conference publication [8]) is a more general structure.The work [9] generalizes the standard distributed learning framework to include local updates, while [10] has a similar architecture to the GFL architecture proposed earlier in [8], it nevertheless does not deal with privacy and employs different objective functions and a different learning algorithm based on the alternating direction method of multipliers.Likewise, a plethora of solutions exist that relate to privacy issues.These methods may be split into two subgroups: those using random perturbations to ensure a certain level of differential privacy [11][12][13][14][15][16][17][18][19][20], or those that rely on cryptographic methods [21][22][23][24][25].Both have their advantages and disadvantages.While differential privacy is easy to implement, it hinders the performance of the algorithm by reducing the model utility.As for cryptographic methods, they are generally harder to implement since they require more computational and communication power [26,27].Furthermore, they restrict the number of participating users.Moving forward, we go ahead with the study of differentially private methods.
The main contribution in this work is three-fold.We introduce a new generalized and more realistic architecture for the federated setting where we now consider multiple servers connected by some graph structure.Furthermore, many earlier works have proposed adding Laplacian noise sources to the shared information among agents in order to ensure some level of privacy.However, these works have largely ignored the fact that these noises degrade the mean-square error (MSE) performance of the network from O(µ) down to O(µ −1 ) , where µ is the small learning parameter.To resolve this issue, we define a new noise generation scheme that mantains the MSE at O(1) while ensuring privacy.Although the work [20] proposed a noisy-distributed consensus strategy, this reference lacks a useful construction method for the perturbations.In this work, we devise a construction scheme.Therefore, the main difference between our proposed method and previous works is that we devise a noise construction scheme that ensures the total sum of the added noise cancels out centrally.This results in the improved MSE bound of O (1).Finally, we prove that clients sharing noisy updates as opposed to noisy models lead to improved performance relative to what is commonly done in the prior literature.Moreover, we do not assume bounded gradients, as commonly assumed in previous works [12,15,16], since this condition does not actually hold in most situations in practice.Note, for instance, that even quadratic risks do not have bounded gradients.For this reason, we will not rely on this condition, and will instead be able to show that our noise construction is able to ensure differential privacy with high probability for most cases of interest.The main results shown in this work are as follows: 1. Privatized GFL under graph-homomorphic perturbations converges in the MSE sense to an O(1) neighbourhood of the true model w o as opposed to O(µ −1 ) when random perturbations are used instead.2. Privatized FL under perturbed gradients converges in the MSE sense to an O(µ) neighbourhood of the true model w o as opposed to O(µ −1 ) when perturbed models are shared instead.3. GFL with graph-homomorphic perturbations and perturbed gradients is ǫ(i)-differentially private with high probability.

Graph federated architecture
In the graph federated architecture, which we initially introduced in [8], we consider P federated units connected by a graph structure.Each federated unit consists of a server and a set of K agents.Thus, the overall architecture can be represented as a graph depicted in Fig. 1.We denote the combination matrix connecting the servers by A ∈ R P×P , and we write a mp to refer to the elements of A. We assume each agent of every server has its own dataset {x p,k,n } N p,k n=1 that is non-iid when compared to the other agents.The subscript p refers to the federated unit, k to the agent, and n to the data sample.We note the difference between our proposed architecture and a fully distributed setting.The graph federated architecture consists of a network of federated units while Fig. 1 The graph federated learning architecture a fully distributed network removes the need for servers and assumes clients are connected to each other based on some graph structure.Such an architecture is an improvement on the original federated architecture and not necessarily on the fully distributed architecture.Instead of clients communicating with the same server, we split the load among multiple servers.
With this architecture, we associate a convex optimization problem that will take into account the cost function at each federated unit.Thus, the optimization goal is to find the optimal global model w o that minimizes an average empirical risk: where each individual cost is an empirical risk defined over the local loss functions To solve problem (1) each federated unit p runs the standard federated averaging (Fed-Avg) algorithm [1].An iteration i of the algorithm consists of the server p selecting a subset of L participating agents L p,i .Then, in parallel, each agent runs a series of sto- chastic gradient descent (SGD) steps.We call these local steps epochs, and denote an epoch by the letter e and the total number of epochs by E p,k .The sampled data point at an agent k in the federated unit p during the e th epoch of iteration i is denoted by b.Thus, during an iteration i, each participating agent k ∈ L p,i updates the last model w p,i−1 and sends its new model w p,k,E p,k to the server after E p,k epochs.During a single epoch e, the agent updates its current local model w p,k,e−1 by running a single SGD step.Thus, an agent repeats the following adaptation step for e = 1, 2, . . ., E p,k : with x p,k,b be the sampled data of agent k in federated unit p, and w p,k,0 = w p,i−1 .After all the participating agents k ∈ L p,i run all their epochs, the server aggregates their final models w p,k,E p,k , which we rename as w p,k,i since it is the final local model at iteration i: Next, at the server level, these estimates are combined across neighbourhoods using a diffusion type strategy, where we first consider the previous steps (3) and (4) as the adaptation step and the following step as the combination step: To introduce privacy, the models communicated at each round between the agents and the servers need to be encrypted in some way.We could either apply secure multiparty (1) ( (5) w p,i = m∈N p a pm ψ m,i .computation (SMC) tools, like secret sharing, or use differential privacy.We focus on differential privacy or masking tools that can be represented by added noise.Thus, we let agent 1 in federated unit 2 add a noise component g 2,1,i to its final model w 2,1,i at itera- tion i, and then let serever 2 add g 12,i to the message ψ 2,i it sends to server 1.More gen- erally, we denote by g pm,i the noise added to the message sent by server m to server p at iteration i.Similarly, we denote by g p,k,i the noise added to the model sent by agent k to server p during the ith iteration.We use unseparated subscripts pm for the inter-server noise components to point out their ability to be combined into a matrix structure.Contrarily, the agent-server noise components' subscripts are separated by a comma to highlight a hierarchical structure.Thus, the privatized algorithm can be written as a client update step (6), a server aggregation step (7), and a server combination step (8): The client update step (6) follows from (3) by combining the multiple epochs for e = 1, 2, . . ., E p,k into one update step, with w p,k,i = w p,k,E p,k and w p,k,0 = w p,i−1 , namely:

Performance analysis
In this section, we show a list of results on the performance of the algorithm.We study the convergence of the privatized algorithm ( 6)- (8), and examine the effect of privatization on performance.

Modeling conditions
To go forward with our analysis, we require certain reasonable assumptions on the graph structure and cost functions.
Assumption 1 (Combination matrix) The combination matrix A describing the graph is symmetric and doubly-stochastic, i.e.: (6) Furthermore, the graph is strongly-connected and A satisfies: Assumption 2 (Convexity and smoothness) The empirical risks J p,k (•) are ν − strongly convex, and the loss functions Q p,k (•; •) are convex, namely for ν > 0,: Furthermore, the loss functions have δ-Lipschitz continuous gradients, meaning there exists δ > 0 such that for any data point x p,n : We also require a bound on the difference between the global optimal model w o and the local optimal models w o p,k that optimize J p,k (•) .This assumption is used to bound the gradient noise and the incremental noise defined further ahead.It is not a restrictive assumption, and it imposes a condition on when collaboration is sensical among different agents.In other words, since the agents have non-iid data, sometimes their optimal models are too different and collaboration would hurt their individual performance.For example, when considering recommender systems, people in the same country are more likely to get the same movie recommended as opposed to across different countries.This means, people of the same country might have different models but relatively close contrary to different countries.

Assumption 3 (Model drifts)
The distance of each local model w o p,k to the global model w o is uniformly bounded, i.e., there exists ξ ≥ 0 such that �w o − w o p � ≤ ξ.

Network centroid convergence
We study the convergence of the algorithm from the network centroid's w c,i perspective: We write the central recursion as: (10) Next, we define the model error as w c,i = w o − w c,i and the average gradient noise: with the per-unit gradient noise s p,i : and We introduce the average incremental noise q i and the local incremental noise q p,i , which capture the error introduced by the multiple local update steps: We then arrive at the following error recursion: where g i is the total added noise at iteration i: We estimate the first and second-order moments of the gradient noise in the following lemma.To do so, we use the fact, shown in previous work (Lemma 1 in [28]), that the individual gradient noise is zero-mean with a bounded second order moment: where the constants are defined as: (16) and F i−1 is the filtration defined over the randomness introduced by all the past subsam- pling of the data for the calculation of the stochastic gradient.Using Assumption 3, we can guarantee that σ 2 s,p is bounded by bounding: Lemma 1 (Estimation of first and second-order moments of the gradient noise) The gradient noise defined in (17) is zero-mean and has a bounded second-order moment: where the constants β 2 s and σ 2 s are given by: Proof The above result follows from applying the Jensen's inequality and the bounds on the per-unit gradient noise s p,i .The new term found in the bound of the gradient term is what we call the network disagreement: It captures the difference in the path taken by the individual models versus the network centroid.We bound this difference in Lemma 3.However, before doing so, we show that the second order moment of the incremental noise is on the order of O(µ) .From Lemma 5 in [28], we can bound the individual incremental noise: where the constants are given by: (25) The following result follows.
Lemma 2 (Estimation of second-order moment of the incremental noise) The incremental noise defined in (20) has a bounded second-order moment: where the constant σ 2 q is the average of σ 2 q,p,k : Proof The above result follows from applying the Jensen inequality and the bounds on the per-unit incremental noise q p,i .Furthermore, a = O(µ −1 ), b k = O(µ −1 ), and c k = O(1) reduce the expression to (37).We now bound the network disagreement.To do so, we first introduce the eigendecomposition of A = QHQ T : where H θ is a diagonal matrix that includes the last (P − 1) eigenvalues of A and Q θ their corresponding eigenvectors.

Lemma 3 (Network disagreement)
The average deviation from the centroid is bounded during each iteration i: (37) where W 0 = col w p,0 P p=1 and p ) ∈ (0, 1) .Then, in the limit: Thus, from the above lemma, we see that the individual models gravitate to the centroid model with an error introduced due to the added privatization.The effect of the added noise overpowers that of the gradient and incremental noise, since the later is on the order of the step-size.
Then, using the above result, we can establish the convergence of the centroid model to a neighbourhood of the true optimal model w o in the mean-square-error (MSE) sense.
Theorem 1 (Centroid MSE convergence) Under Assumptions 1, 2 and 3, the network centroid converges to the optimal point w o exponentially fast for a sufficiently small step- size µ: . Then, letting i tend to infinity, we get: (40) Proof See "Appendix 3".
The main term in the above bound is the variance of the added noise with a dominating factor of µ −1 , since: which allows us to rewrite the bound as follows: with E g 2 representing the variance of the total added noise, independent of time.While in general decreasing the step-size improves performance, the above result shows that this need not be the case with privatization.Thus, since the added noise impacts the model utility negatively, it is important to choose a privatization scheme that reduces the effect.In what follows, we look closely at such a scheme.

Graph-homomorphic perturbations
We consider a specific privatization scheme and specialize the above results.The goal of the scheme is to remove the O(µ −1 ) term from the MSE bounds.Thus, focusing on the centroid model expression (16), we wish to cancel out the total added noise amongst servers, i.e., To achieve this, we introduce graph-homomorphic perturbations defined as follows [6].We assume each server p draws a sample g p,i independently from the Laplace distribu- tion Lap(0, σ g / √ 2) with variance σ 2 g .Server p then sets the noise g mp,i added to the mes- sage sent to its neighbour m as: With such a construction, condition (46) is satisfied: Thus, with such a scheme, the noise components proportional to O(µ −1 ) resulting from the noise added between the servers cancel out in the error recursions, however since (44 P p,m=1 a pm g pm,i = 0. (47) gradients are evaluated at the local models w p,i and not at the centroid w c,i , thus the effect of the noise is still evident.Yet, this remaining error introduced by the noise is controlled by the step-size.Thus, its effect can be mitigated by using a smaller step-size.
In the next corollary, we show that if no noise is added amongst the clients and graphhomomorphic perturbations are used amongst servers, then the error converges to O(1)σ 2 g .
Corollary 1 (Centroid MSE convergence under graph-homomorphic perturbations) Under Assumptions 1, 2 and 3, the network centroid with graph-homomorphic perturbations converges to the optimal point w o exponentially fast for a sufficiently small step-size µ: Then, letting i tend to infinity, we get: Proof Starting from (43), and replacing E�g� 2 = 0 because g i = 0 , we get the final result.

Sharing gradients as opposed to weight estimates
We next show that sharing gradients versus models is better for the performance under added noise.In the remainder of this section and for the sake of simplicity, we illustrate this conclusion by considering one federated unit, say for p = 1 .Thus, if we were to introduce differential privacy to federated learning, then a random Laplacian noise should be added to each model by the client before aggregation by the server, and the new privatized aggregation step will become: However, if we were to study the MSE convergence of this privatized algorithm, we would notice a new O(µ −1 )σ 2 g term in the bound (Theorem 1).To address this degradation, we now describe an alternative implementation that shares gradients as opposed to weight estimates.Note first that the FL algorithm can be expressed in a single step taken from the server's perspective: (49) (51) This suggests that instead of every agent sharing its final model w 1,k,i , they could share the total update: The server then aggregates the updates from all participating agents and updates the previous model w 1,i−1 .In this case, if we were to privatize this new version of the algo- rithm, we would add random noise to the updates which are then scaled by the step-size: We show in the following theorem the effect of the added noise to the new FL algorithm.It turns out, the noise introduces an O(µ) error instead of O(µ −1 ).
Thus, sharing the updates instead of the models is advantageous since the effect of the added noise on the performance is reduced.The O(µ) factor allows us to increase the value of the noise variance while ensuring the model utility does not deteriorate significantly.Therefore, to guarantee an ǫ(i)-DP algorithm, we let the added noise be a zero-mean Lapla- cian random variable with σ 2 g variance.

Privacy analysis
We study the privacy of the algorithm ( 6)-( 8) in terms of differential privacy.We focus on graph-homomorphic perturbations and show that the adopted scheme is differentially private.To do so, we first define what it means for an algorithm to be ǫ-differentially private.Therefore, without loss of generality, assume agent 1 in federated unit 1 decides to not (52) participate, and its data samples x 1,1 are replaced by a new set x ′ 1,1 with a different distribution.Then, with the new data, the algorithm takes a different path.We denote the new models by w ′ p,k,i .The idea behind differential privacy is that an outside observant should not be able to distinguish between the two trajectories w p,k,i and w ′ p,k,i and conclude whether agent one participated in the training.More formally, differential privacy is defined bellow.Definition 1 (ǫ(i)-Differential Privacy) We say that the algorithm given in ( 6)-( 8) is ǫ(i)-differentially private for server p at time i if the following condition holds on the joint distribution f (•): Thus, the above definition states that minimaly varried trajectories have comparable probabilities.In addition, the smaller the value of ǫ is, the higher the privacy guarantee will be.Thus, the goal will be to decrease ǫ as long as the model utility is not strongly affected.
Next, in order to show that the algorithm is differentially private, we require the sensitivity of the algorithm to be bounded.The sensitivity at time i is defined as: It measures the distance between the original and perturbed weight vectors.It is shown in "Appendix 4" that �(i) can be bounded as follows: for constants B and B ′ chosen by the designer.Moreover, the above bound holds with high probability given by: This result shows that the sensitivity can be bounded with high probability, which in turn is dependent on the values chosen for B and B ′ .Larger values for these constants increase the probability, but nevertheless lead to a looser bound for privacy (as shown in Theorem 3).Therefore, the choice of B and B ′ needs to be balanced judiciously to ensure the desired level of privacy.(59) Using the bound on the sensitivity and from the definition of differential privacy, we can finally show that the algorithm is differentially private with high probability.
Theorem 3 (Privacy of GFL algorithm) If the algorithm (6)-( 8) adopts graph-homomorphic perturbations, then it is ǫ(i)-differentially private with high probability, at time i for a standard deviation of Proof See "Appendix 5".
Thus, the above theorem suggests, if we wish the algorithm to be ǫ(i)-differentially pri- vate, then we need to choose the noise variance accordingly.The larger the variance is, the more private the algorithm will be.However, the longer the algorithm is run, we will require a larger noise variance to keep the same level of privacy guarantee.Said differently, if we fix the added noise, then as time passes, the algorithm becomes less private, and more information is leaked.However, with graph-homomorphic perturbations, we can afford to increase the variance since its effect is constant on the MSE, and thus decreases the leakage.
Moreover, we study the effect of the model drift on the privacy of the algorithm.Thus, if we examine closely the probability that the sensitivity is bounded, the model drift ξ appears in the O(µ) term.The smaller the model drift is, we note that the higher the probability that the sensitivity is bounded.This in turn implies that the algorithm is differentially private with higher probability.Furthermore, if we study the average ǫ(i) , we see that: as the model drift decreases, so does ǫ(i) on average.Therefore, with smaller model drift we can achieve higher privacy with more certainty. (62)

Experimental analysis
We conduct a series of experiments to study the influence of privatization on the GFL algorithm.The aim of the experiments is to show the superior performance of graphhomomorphic perturbations to random perturbations and perturbations to gradients versus models, and to study the effect of different parameters on the performance of the algorithm.

Regression
We first start by studying a regression problem on simulated data.We do so for the tractability of the problem.We consider the quadratic loss that has a closed form solution, i.e., a formal expression for the true model w o is known, which makes the calculation of the mean square error feasible and more accurate.Therefore, consider a streaming feature vector u p,k,n ∈ R M with output variable d p,k (n) ∈ R given by: where w ⋆ ∈ R M is some generating model, and v p,k (n) is some zero-mean Guassian ran- dom variable with σ 2 v p,k variance and independent of u p,k,n .Then, the optimal model that solves the following problem: is found to be: where R u and r uv are defined as: We consider P = 10 units, each with K = 100 total agents.We assume, N p,k = 100 for each agent.We randomly generate two-dimensional feature vectors u p,k (n) from a Gaussian random vector with zero-mean and a randomly generated covariance matrix R u p,k .We then calculate the corresponding outputs according to (63).To make the data non-iid across agents, we assume the covariance matrix R u p,k is different for each agent, as well as the variance σ 2 v p,k of the added noise.When running the algorithm, we assume (63) each unit samples at random L = 11 agents, and each agent runs E p,k ∈ [1,10] epochs and uses a mini-batch of B p,k ∈ [5,10] samples.We compare three algorithms: the standard GFL algorithm, the privatized GFL algorithm with random perturbations, and the privatized GFL with homomorphic perturbations.We do not add noise between the clients and their server to focus on the effect of the perturbations between the servers.In the first set of simulations, we fix the step-size µ = 0.7 and the regularization parameter ρ = 0.1 .We fix the variance of the added noise for privatization in both schemes to σ 2 g = 0.1 .We then plot the mean-square-deviation (MSD) at each time step for the centroid model: as seen in Fig. 2. We observe that the privatized GFL with random perturbations has lower performance compared to the two algorithms.While, using homomorphic perturbations does not result in such a decay in performance.Thus, our suggested scheme does a good job at tracking the performance of the original GFL algorithm, while not compromising with the privacy level.
We next study the extent of the effect of the noise on the model utility.Thus, we run a series of experiments with varying added noise σ 2 g = {0.001,0.01, 0.1, 1, 2, 10} for the two privatized GFL algorithms.We plot the resulting MSD curves in Fig. 3a.We observe for a fixed step-size, as we increase the variance, the MSD of the algorithm with random perturbations increases significantly as opposed to the algorithm with homomorphic perturbations.Thus, we conclude that the algorithm with random perturbations is more sensitive to the variance of the added noise.In fact, at some point, while using random perturbations, for some variance, the algorithm breaks down.While using graph-homomorphic perturbations, delays that effect for much larger variance.In addition, as long as the step-size is small enough, we can always control the effect of the graph-homomorphic perturbations.
However, if we were to look at the individual MSD for one federated unit, we would discover that the performance of the algorithm decays as the noise variance is increased.Nonetheless, it is not to the extent of random perturbations.We plot in Fig. 3b the average individual MSD for the varying noise variance: (68) Fig. 2 Performance of GFL with no perturbations (blue), with graph-homomorphic perturbations (green), and random perturbations (red) We observe that for a fixed noise variance, homomorphic perturbations results in a better performance.Furthermore, as we increase the noise variance, the network disagreement increases for both schemes.This comes as no surprise and is in accordance with Lemma 3. Furthermore, as previously mentioned, graph-homomorphic perturbations have the added value of not being negatively affected by the decrease in the step-size.In addition, even though the improvement does not seem significant, the source of the error of the two schemes is different.Furthermore, the information the true model is distributed in the network and can be retrieved by running at the end of the learning algorithm a consensus-type step.At that point, the local models no longer contain information about the local data, and thus agents can safely share their models.However, when random perturbations are used, reconstruction is not possible since the information has been lost in the network due to the added perturbations.
We next fix the noise variance σ 2 g = 0.1 and varying the step-size µ = {0.1,0.5, 1, 5} .According to Theorem 4, the MSD resulting from random perturbations includes an O(µ −1 ) term, which is not the case when using graph-homomorphic perturbations.Thus, we expect a decrease in the step-size will not significantly affect the privatized algorithm with graph-homomorphic perturbations as opposed to random perturbations.Indeed, as seen in Fig. 4, as µ is increased, the final MSD increases; this is prob- ably due to the O(µ)σ 2 s term in the bound.In contrast, for significantly small or large µ , the performance of the privatized algorithm with random perturbations decreases.In addition, what we observe for both privacy schemes, is that the rate of convergence slows down as we decrease the step-size.Thus, there exists an optimal step-size that achieves a good compromise between a fast convergence and a low MSD.

Privatized federated learning
We focus on the single server FL setting (i.e., P = 1 ), where we assume we have K = 1000 agents of which we choose L = 30 at a time.We generate non-iid datasets of varying size for each agent as in the previous section.We allow each agent to run varying epochs E k ∈ [1,10] during an iteration of the algorithm.We set the step-size µ = 0.2 , ρ = 0.007 and σ 2 g = 0.02 .We compare three algorithms: the standard FL algo- rithm, the privatized FL algorithm with sharing of models, and the privatized FL algorithm with sharing of updates.We plot the average MSD curves after repeating the experiment 100 times.As expected, the effect of the added noise is worse when models are shared (yellow curve Fig. 5) than when updates are shared (red curve Fig. 5).
We next study the effect of the step-size on the MSD of the FL algorithm.We expect that as µ is increased the MSD increases for the FL algorithm when updates are shared.While, when models are shared, since the gradient noise variance is tuned by µ and the added noise variance by µ −1 , we expect to observe a trade-off.On one hand, as µ is increased the effect of the gradient noise is increased while that of the added noise is diminished.On the other hand, as µ is decreased, the effect of the added noise overpow- ers that of the gradient noise.Indeed, we observe this phenomenon in (a) and (b) of Fig. 6.
Finally, we study the effect of the variance of the added noise.We fix the step-size at µ = 0.2 and vary the noise variance σ 2 g = {0.01,0.05, 0.1, 0.5} .In the two cases, as we increase σ 2 g the performance diminishes ((c), (d) of Fig. 6).However, the larger values of the added noise variance affect the perturbed models more than the perturbed gradients.The algorithm diverges for lower values of σ 2 g in the case when models are shared as opposed to when gradients are shared.Thus, sharing updates can handle larger values of σ 2 g before the algorithm diverges.In addition, since the variance is tuned by the step-size, we can always find a suitable µ to decrease its effect.

Classification
We now focus on a classification problem applied to a dataset on click rate prediction of ads.We consider the Avazu click through dataset [29].We split the 5101 data unequally among a total of 50 agents.We assume there are P = 5 units each with K = 10 agents.We add non-idd noise to the data at each agent to change their dis- tributions.We again compare three algorithms: standard GFL, privatized GFL with homomorphic perturbations, and privatized GFL with random perturbations.We a regularized logistic risk with regularization parameter ρ = 0.03 .We set the step- size µ = 0.5 .We repeat the algorithms for multiple levels of privacy.We then settle on a noise variance σ 2 g = 0.6 for which the privatized algorithm with random pertur- bations still converges.We plot in Fig. 7 the testing error on a set of 256 clean samples that were not perturbed with noise to change their distributions.We use the centriod model learned during each iteration to calculate the corresponding testing error.We observe that the graph-homomorphic perturbations do not hinder the performance of the privatized model.As for random perturbations, they significantly reduce the utility of the learnt model.

Conclusion
In this work, we introduced graph federated learning and implemented an algorithm that guarantees privacy of the data in a differential privacy sense.We showed general privatization based on adding random perturbations to updates in federated learning have a negative effect on the performance of the algorithm.Random perturbations drive the algorithm farther away from the true optimal model.However, we showed by adding graph-homomorphic perturbations, which exploit the graph structure, performance can be recovered with guaranteed privacy.We also showed that using dependent perturbations does not result in the same trade-off between privacy and efficiency.In federated learning, we proved that sharing perturbed gradients versus perturbed models significantly reduces the effect of the added noise on the model utility.Thus, we no longer have to choose what to prioritize, and instead, we can have both a highly privatized algorithm with a good model utility.

Appendix 1: Secondary result on individual MSE performance
We first introduce the following theorem, which will be used to bound the network disagreement.We loosely bound the individual MSE for each federated unit.A tighter bound can be found, however, it is not needed.
Theorem 4 (Individual MSE convergence) Under Assumptions 1, 2 and 3, the individual models converge to the optimal model w o exponentially fast for a sufficiently small step-size: where is the elementwise comparison, is a diagonal matrix with the p th entry given by p = 1 − 2νµ + δ 2 µ 2 + β 2 s,p µ 2 + O(µ 2 ) ∈ (0, 1) , σ 2 q,p the average of σ 2 q,p,k , and σ 2 g,p is the total variance introduced by the noise added at server p.Then, taking the limit of i to infinity: (70) Fig. 7 Testing error of GFL with no perturbations (blue), with graph-homomorphic perturbations (green), and random perturbations (red) Proof Focusing on the error of a single server p, we can verify that: where we define σ 2 q,m to be the average of σ 2 q,m,k .
Step (a) follows from independence of random variables and the zero-mean of the gradient noise and the added noise, (b) from Jensens' inequality and the bound on the gradient noise (24) and the incremental noise (37), (c) from ν-strong convexity and δ-Lipschtz continuity.Then, choosing α = 1 − 2νµ + δ 2 µ 2 = 1 − O(µ) and taking the expectation over the filtration, we get: (71) where we introduce the constants m and σ 2 g,m : Next, taking the column vector of every local mean-square-error, we get the following bound in which we drop the indexing from the column vectors: where we define the diagonal matrix with p as entries on the diagonal.Then choosing µ small enough such that p < 1 for every p, we know the limit of i as i goes to infinity is zero.Furthermore, if the eigenvalues of are less than 1, which they are, then the geometric series converges to (I − �) −1 .Thus, we get the desired result.

Appendix 2: Proof of Lemma 3
Consider the aggregate model vector, i.e., W i = col w p,i P p=1 , for which we write the model recursion as: where G i is a matrix whose entries are the noise g pm,i , and the diag(•) function extracts the diagonal entries of a matrix and transforms them into a column vector.
Since A is doubly-stochastic, then it admits an eigendecomposition of the form A = QHQ T , with the first eigenvalue equal to 1 and its corresponding eigenvector equal to 1/ √ P.
Next, we define the extended centroid model W c,i = 1 P 11 T ⊗ I W i , and write: (74) Then, taking the conditional expectation given the past models of �(Q ǫ ⊗ I)W i � 2 , we can split the gradient noise and the added privacy noise from the model and the true gradient.Taking again the expectation over the past data, and then using the sub-multiplicity property of the norm followed by Jensen's inequality, we have: Next, we focus on each individual term.Using Jensen for some constant α and then the Lipschitz condition and the bound on the incremental noise, we can bound the below norm as follows: Using the bound on the gradient noise (24), we get another E� w p,i−1 � 2 term, which can be bounded by the result in Theorem 4. Thus, we write: (77) The noise term can be witten in a more compact way, �Q ǫ ⊗ I� 2 P p=1 σ 2 g,p .Thus, putting everything together, we get: Going back to the network disagreement, it is bounded by the above bound multiplied by �Q T ǫ ⊗ I� 2 /P .If we were to drive i to infinity, since �H ǫ � = ι 2 < 1 , with ι 2 being the second eigenvalue of A, and choosing α = ι 2 we would have: (80)

Appendix 3: Proof of Theorem 1
First taking the conditional mean of the ℓ 2 -norm of the centroid error given the past models, splits the mean into three independent terms: the centralized recursion, the gradient noise and the added noise.Then, taking the expectation again, we get: (82)  Finally, using the result on the network disagreement, recusrively bounding the error, and taking the limit of i, we get the final result: To show the sensitivity of the algorithm is bounded with high probability, we require a bound on E W i 2 and E� W ′ i � 2 .From Theorem 4 we can bound the individual errors by: where max = max p p .Then, E W i 2 can be bounded as follows: It follows that for some constants B and B ′ , the probability that E W i and E� W ′ i � are unbounded can be bounded using Markov's inequality by: and similarly for P(� W   (88) We have already shown in previous work that the gradient noise is zero-mean and has bounded second order-moment [28, Lemma 1], while the incremental noise has bounded second order-moment [28, Lemma 5]: where the constants β 2 s,1 , σ 2 s,1 , σ 2 q,1 are given by: Taking the conditional mean of the ℓ 2 -norm of the error, we can split the noise term from the rest and then apply Jensen's inequality with some constant α ∈ (0, 1): ∇ w T J 1,k (w 1,k,e−1 ) − ∇ w T J 1,k (w 1,i−1 ). ( (99) (100) (101 (102) lim sup i→∞ E� w 1,i � 2 ≤O(µ)(σ 2 s,1 + ξ 2 + σ 2 g,1 ) + O(µ 2 )σ 2 q,1 .

Fig. 3
Fig. 3 Performance curves of privatized GFL with varying noise variance

σ 2 gFig. 6
Fig. 6 MSD plots of privatized FL with varying step-size and variance of added noise

Appendix 4 :
Secondary result involving a bound onE W i 2