Achieve data privacy and clustering accuracy simultaneously through quantized data recovery

This paper develops a data collection and processing framework that achieves individual users’ data privacy and the operator’s information accuracy simultaneously. Data privacy is enhanced by adding noise and applying quantization to the data before transmission, and the privacy of an individual user is measured by information-theoretic analysis. This paper develops a data recovery and clustering method for the operator to extract features from the privacy-preserving, partially corrupted, and partially observed measurements of a large number of users. To prevent cyber intruders from accessing the data of many users, it also develops a decentralized algorithm such that multiple data owners can collaboratively recover and cluster the data without sharing the raw measurements directly. The recovery accuracy is characterized analytically and showed to be close to the fundamental limit of any recovery method. The proposed algorithm is proved to converge to a critical point from any initial point. The method is evaluated on recorded Irish smart meter data and UMass smart microgrid data.

(2020) 2020: 22 Page 2 of 36 [8,9]. Then, applying the NILM to these noisy and quantized measurements, an intruder can no longer accurately identify the patterns of individual appliances and, in turn, the user behavior. The increase in user privacy is achieved, however, at a cost of data distortion and reduced data accuracy for the operating center [10][11][12]. Although the operating center does not need high-time-resolution information of every individual appliances in each household, it still requires accurate estimation of the aggregated power consumption and the common load patterns among households for forecasting, demand response, and planning. For example, the center clusters customers with similar load patterns and then employs the load pattern of each cluster to enhance the load forecasting accuracy [13] and determine the incentives for demand response [14,15]. If noise and quantization are added to the data to enhance the privacy, the information accuracy for the operator is effectively reduced. This paper shows that the data privacy can be protected for each individual user 1 and, at the same time, the information accuracy at the operating center about user power consumption and the major patterns among different users are maintained. To the best of our knowledge, this is the first work that achieves data privacy and information accuracy simultaneously. In our proposed framework, each user's actual power consumption is masked by first adding noise to the measurements and then quantizing the output to one of a few levels. The privacy of an individual user can be enhanced in this way, from an information-theoretic perspective [16][17][18][19]. Once the data is quantized, the variation information is blurred and hence NILM methods fail to identify individual appliances. Although adding noise and quantization have been employed before to enhance privacy (e.g., [6,20]), this paper, for the first time, shows that such privacy enhancement does not necessarily lead to a reduction in the information accuracy. The central technical contribution of this paper is the development of a data recovery and clustering method, even when the measurements are highly noisy and quantized, contain significant errors, and are partially lost. Our method is proved to provide accurate data recovery and clustering results, as long as the center has measurements from a sufficient number of users. In contrast, a cyber intruder with access to the measurements of a small number of users cannot obtain accurate information even with the same approach. We develop a decentralized algorithm that allows multiple data owners to cooperatively recover and cluster the data without sharing their own raw measurements directly. Then, it is extremely difficult for an intruder to access large amounts of data. Thus, the data privacy of an individual user is enhanced while maintaining the information accuracy for the operating center.
Since the load profiles with similar load patterns can be represented by data points in a low-dimensional subspace in the high-dimensional ambient space, all the load profiles can be characterized by the Union of Subspaces (UoS) model [21], and the load clustering problem can be formulated as a subspace clustering problem. Various subspace clustering techniques have been developed, see e.g., [21][22][23][24][25][26]. None of these approaches, however, considers the case that the measurements are highly quantized. To the best of our knowledge, only one recent work considered subspace clustering and data recovery from highly noisy and quantized data [27]. This paper follows the mathematical setup of [27] but extends significantly in the following aspects. Ref. [27] does not consider data privacy, while this paper proposes a data collection framework to achieve data privacy and information accuracy simultaneously. We characterize the data privacy through mutual information, and such analysis does not exist in [27]. Ref. [27] assumes that all the measurements are available to the center, while this paper considers a more general setup that partial measurements are lost during the transmission and do not arrive at the center. This paper characterizes the data recovery error by our proposed method analytically as a function of data loss percentage. Moreover, this paper characterizes the fundamental limit of the recovery error by any possible recovery method and shows that our method is nearly optimal in reducing the recovery error. All these fundamental analyses do not exist in [27]. Furthermore, only a centralized algorithm Sparse-APA is discussed in [27]. This paper develops a Distributed Sparse Alternative Proximal Algorithm (DSAPA) for multiple data owners to collaboratively solve the subspace clustering and data recovery problem without sharing the measurements with others. Thus, the user data privacy can be further protected. This paper is also related to the quantized matrix recovery problem [28][29][30][31][32][33][34][35][36], in which the data matrix is assumed to be low rank. The low-rank matrix model is a special case of the UoS model by restricting to one subspace only. In fact, the data matrix of the load profiles can be high rank or even full rank in our setup. Finally, we remark that this paper considers smart meter measurements that measure the aggregated energy consumption in a house, and does not consider applying NILM on the operator side. Distributed smart metering can provide energy consumption of individual electrical appliances in a house [20].
The rest of the paper is organized as follows. Section 2 introduces our proposed framework, problem formulation, related works, and the data privacy enhancement analysis. The theoretical analyses of our recovery and clustering method is presented in Section 3. Section 4 introduces the details of the DSAPA with its convergence guarantee. Section 5 records the numerical experiments of our method on the real smart meter dataset. Section 6 concludes the paper. All the proofs are deferred to Appendix 1, Appendix 2, Appendix 3, Appendix 4, Appendix 5, Appendix 6, and Appendix 7.
2 Our proposed framework of privacy-preserving data collection and information recovery 2.1 Our framework and problem formulation Figure 1 visualizes our proposed framework of privacy-preserving smart meter data collection and information recovery. To enhance the user data privacy, the actual power consumption is mapped to a few fixed power levels at the output of the smart meter. One can achieve this through signal processing in the smart meter or connecting a rechargeable battery to each household. Thus, the actual consumption is masked in the noisy and quantized smart meter measurements. As shown in Fig. 1, the measurements are collected by W agents disjointly, and agents do not share measurements directly. The agents recover the data and cluster the users with similar consumption patterns collaboratively in a distributed fashion. When W = 1, it reduces to the case of one single center. We defer the discussion of user privacy enhancement through the proposed framework to Section 2.3. We first define the recovery and clustering problem from quantized data mathematically as follows. L * ∈ R m×n denotes the actual power usages of n users, with each column containing the power usage of one user in m time instants. We assume that users with similar consumption patterns belong to the same group and there are p groups in total. The corresponding columns of the same group belong to a d-dimensional ) denote the ith subspace, and these p subspaces are distinct 2 . Let r denote the rank of L * , then r ≤ pd. Let L * i denote the submatrix of L * that contains points in S i , and let n i denote the number of columns in L * i , i.e., the number of users in group i. We assume m ≤ n i ≤ ξ n/p for all i and some positive constant ξ . We further assume m = n/κp for some positive constant κ to simplify the representation of main results.
There exists a coefficient matrix C * ∈ R n×n such that L * = L * C * , C * i,i = 0 for all i ∈[ n]. Moreover, C * i,j is zero if the ith and jth columns of L * do not belong to the same subspace [21]. We summarize these two properties as self-expressive property and subspace-preserving property in Definition 1. These properties have been exploited in the literature of subspace clustering and are summarized as follows.
Definition 1 [27] A matrix L ∈ R m×n has the self-expressive property if L = LC for some C ∈ R n×n , and C i,i = 0 for all i ∈[ n]. Moreover, C has the subspace-preserving property of L if C i,j = 0 for columns i and j of L belonging to different subspaces. Let matrix E * ∈ R m×n denote the additive errors in the measurements. We assume the number of nonzeros s in E * is much smaller than mn. The partially corrupted measurements can be represented by X * = L * + E * . We assume the energy consumption and the errors are bounded, i.e., L * ∞ ≤ α 1 and E * ∞ ≤ α 2 , for some positive constants α 1 , α 2 , and the infinity norm · ∞ measures the maximum absolute value.
The quantization process in each household is modeled as follows. The measured energy consumption at each time step is mapped to one of K values in a probabilistic fashion. Figure 2 shows the quantization process. It can be modeled as adding random noise first and then quantizing to K levels. N ∈ R m×n is independent from X * . Entries of N are i.i.d. generated from a fixed cumulative distribution function (c.d.f.) (x). The quantization boundaries ω 0 < ω 1 < ... < ω l−1 < ω l ... < ω K and the quantized value Q l , l ∈[ K] for the bin [ ω l−1 , ω l ) are given. Then, the probability of mapping and K l=1 ϕ l X * i,j = 1. The noise N is introduced to hide the user information. One choice of (x) is the probit model with (x) = norm (x/σ ), where norm is the c.d.f. of the standard Gaussian distribution N (0, 1), and σ > 0 is the standard deviation. Note that The quantized measurements Y are sent to the center. Data losses can happen during the communication, visualized by the question marks in Fig. 2. Let set denote the indices of measurements that are not lost. In the general case that the measurements are collected by W agents/nodes separately, we assume for simplicity that each node collects the data from q = n/W users. Node 1 collects the data from the first q users; node 2 collects the next q users and so on. Let i = {q(i−1)+1, q(i−1)+2, ..., qi}, then L * i denotes the submatrix of L * with column indices in i . L * can also written as L * 1 , L * 2 , ..., L * W . Similarly, E * = E * 1 , E * 2 , ..., E * W . Node i collects Y i . The data recovery and pattern extraction problem for one center can be stated as follows. (P1) Given quantized measurement Y , known boundaries ω 0 < ω 1 < ... < ω K and noise distribution , can we recover the real power usages L * and cluster the users through estimating C * simultaneously?
Moreover, if measurements Y i,j 's are not shared among W nodes to protect the user privacy, (P2) Can we estimate L * and C * with W nodes in a decentralized fashion? Some notations in this paper are summarized in Table 1.

Related work
When p = 1, i.e., all the users share the same pattern, L * is approximately a low-rank matrix. Then, (P1) reduces to the problem of low-rank matrix recovery from quantized measurements [28][29][30][31][32][33][34][35][36][37], with motivating applications in image processing [38], collaborative filtering [31], and sensor networks [39]. Note that since there is only one subspace in this case, these works do not consider data clustering and only focus on data recovery. When the quantization process does not exist, the problem (P1) reduces to the conventional subspace clustering problem [21][22][23][24][25][26]40]. If the subspace preserving C * is estimated, one can apply the spectral clustering [41] method to obtain the clustering of the data points. For example, Sparse Subspace Clustering (SSC) [21] is a common choice for subspace clustering, and SSC estimates C * by solving a convex optimization problem. Other clustering methods exist that cluster data points based on the Euclidean distance. For instance, refs. Lin et al. [42] and Keogh et al. [43] leverage a linear combination of box basis functions to approximate the original data, yet still retain the features of interest.
Reference [27] is the first paper that studies the subspace clustering from quantized measurements when p ≥ 1. Wang et al. [27] do not consider missing data and develop a centralized data recovery method from full observations. This paper follows the same problem formulation as [27] and extends to the general case of partial observations. We provide both the recovery guarantee of our approach and the fundamental limit of the recovery accuracy by any method. Moreover, a framework of privacy-preserving smart meter data collection is proposed in this paper, and we further enhance the data privacy by developing a decentralized data recovery method.
Our problem formulation and methods apply to other domains such as image and video processing and phasor measurement unit (PMU) data analytics for power systems. In image recovery and image clustering [27], images of the same person with varying illumination belong to the same low-dimensional subspace [44]. Columns of L * correspond to Table 1 Notations images of multiple people. The goal is to enhance the image quality and cluster the data using low-resolution images. Similarly, in motion segmentation, each column of L * represents the trajectory of a reference point. The reference points in the same rigid object belong to the same subspace. The motion segmentation becomes a subspace clustering problem from the observed measurements. In PMU data analytics, the time series of PMUs affected by the same event belong to the same subspace [32,45]. The event location problem can be solved by subspace clustering.

Data privacy enhancement in the proposed framework
Various methods have been developed to enhance the privacy of power consumption data. For example, one can use pre-processing techniques like temporal averaging, adding additional noise, and quantization [4,5,20] to alter the data. However, directly altering data might affect the accuracy of some applications, e.g., billing and profiling [46]. Alternatively, rechargeable batteries and PV converter can be leveraged to mask the actual power consumption [6,7,47]. The noise addition and quantization process in this paper can be achieved by either signal processing or rechargeable batteries. In general, privacy guarantee can be achieved through either computational hardness [48][49][50] or information-theoretic analysis [16][17][18][19]. The existing analytical results of data privacy only work for specific or simple models and do not easily generalize. For instance, under the setup of communication between two nodes, ref. [17] analyzes the trade-off between data sharing and privacy. Under the assumptions of i.i.d. input load sequence and an i.i.d. energy harvesting process, the minimum information leakage rate is provided with a certain energy management policy in [51]. Some other methods try to analyze data privacy numerically. In [52], the information leakage rate is measured by the relative entropy of the probability measures of the original load data and the modified load data and is calculated by Monte-Carlo method. Refs. [7] and [12] consider measuring the information leakage through mutual information of the original load data and the modified load data. Following the existing work on smart meter data privacy, see, e.g., [19,[52][53][54], this paper analyzes the data privacy from an information-theoretic perspective. The data privacy of an individual user is analyzed by comparing the original data and the data after privacy enhancement through quantities like the Kullback-Leibler (KL) divergence [52], mutual information, and normalized mutual information [18]. In our framework, the actual energy consumption of user i, denoted by L * i , is masked by additive Gaussian noise and quantization, resulting in Y i . Let P L * i and P Y i denote the probability distribution of L * i and Y i , respectively. The privacy can be measured through the normalized mutual information (NI) between L * i and Y i [18], defined as follows: where spaces X and Y are the feasible set of L * i and Y i , respectively. P (L * i ,Y i) is the joint distribution of L * i and Y i . P L * i and P Y i are the marginal distributions of L * i and Y i , respectively.
(2020) 2020: 22 Page 8 of 36 The numerator of (2) is the mutual information between L * i and Y i , and the denominator is the entropy of L * i . When L * i and Y i are independent of each other, NI L * i , Y i reaches its minimum value 0. When Y i is exactly the same as L * i , NI L * i , Y i equals to the maximum value 1. A smaller NI corresponds to a higher level of data privacy of L * i and also indicates more significant difference between L * i and Y i . Note that rigorously speaking, L * i belongs to the continuous space. However, since all measuring devices have a finite resolution, L * i can be viewed as a discrete random variable. When computing NI in practice, one can divide the range of the values into small regions to compute sample probability distribution.
The above information-theoretic measures show that when the data of individual users are processed separately, a user's data privacy is enhanced at the cost of reduced information accuracy. We need to emphasize that the measures like NI or KL divergence focus on an individual signal and do not characterize the information recovery when multiple signals are processed together. In fact, when the data of multiple users are available, and strong correlations exist among different users' data, such correlation can be leveraged to enhance the data accuracy. As stated in problems (P1) and (P2), the major technical objective of this paper is to develop data recovery and clustering methods from quantized measurements of multiple users, where the data correlations are characterized by data points belonging to the same subspace. As we will show in Section 3 (Theorem 1 and Proposition 1), the asymptotic information accuracy from quantized measurements can be achieved when the number of users increases to the infinity. We need to emphasize that this result does not contradict the data privacy enhancement by adding noise and applying quantization. This is because the asymptotic information accuracy is only achieved when processing the correlated data of a large number of users, while a cyber intruder is very unlikely to have access to the data of so many users. In our proposed decentralized data collection and processing framework ( Fig. 1), each agent collects the measurements of a subset of users, and the measurements are not directly shared among the agents. A cyber intruder needs to hack either all these agents or the smart meters of all users to be able to access all the data. Since such attack is very unlikely to happen, the user's data privacy is still protected. Privacy from the recovery perspective will be discussed in details in Section 3.3.

Results: theoretical
Here, we consider solving (P1) at a single center and defer the discussion of solving (P2) in a decentralized way through distributed nodes to Section 4. We propose to estimate L * , C * , and E * by the solution L ,Ê,Ĉ to the following optimization problem, where 1 [A] is the indicator function that takes value 1 if A is true and value 0 otherwise. · 0 measures the number of nonzero entries in a vector or matrix. Data recovery and subspace clustering are achieved simultaneously by solving (3)-(5). Equations (3)-(5) are a constrained maximum log-likelihood estimation problem that maximizes the likelihood of obtaining Y when the underlying data matrix isL, and the error matrix isÊ. The formulation follows (8) of [27] by extending from full observations to partial observations in . After obtainingĈ, spectral clustering [41] is applied toĈ to obtain group labels.
Equations (3)-(5) are nonconvex due to the nonconvexity of the feasible set S f in (5). We first analyze the recovery and clustering performance, assuming that a solution exists. We defer the algorithm to Section 4.

Data recovery guarantee
Two constants γ α and L α are needed for the recovery analysis, whereφ l (x) andφ l (x) are the first-and second-order derivatives with respect to x. Note One can check that ϕ l is strictly log-concave if is log-concave, which holds true for Gaussian and logistic distributions [28]. L α and γ α are bounded by some fixed constants when α 1 , α 2 , and ϕ l are given.
Since the data recovery performance and the clustering performance are coupled together, we first analyze the recovery performance, assuming that the clustering results are not "arbitrarily bad. " We follow the same assumption as [27], which essentially requires that in the estimated clustering results, every cluster contains data points belong to at most a constant number out of p original subspaces. Formally, we have Assumption 1 [27]: Columns ofL belong top subspaces, each of which has a dimension smaller or equal to d. Columns inL with indices corresponding to columns of L * in S i (i ∈ [ p] ) belong to at most (g − 1) subspaces, where g is a constant larger than 1.
We follow the assumption in [28] about the location of the observed entries. We make a minor change to handle multiple subspaces instead of one subspace in [28]. Assumption 2 is a generalization of the uniform sampling and includes the uniform sampling as a special case. We define a binary matrix G with Assumption 2 Assume each column of G i has h nonzero entries. Let σ 1 (G i ) and σ 2 (G i ) denote the largest and the second largest singular values of G i , respectively. Assume Assumption 2 is similar to the sampling assumption in [28]. The difference is that we make the assumption on columns belonging to each subspace instead of the whole matrix. The above assumption is more general than the uniform sampling assumption [28]. where for some positive constants C 1 , Theorem 1 characterizes the recovery error from partially observed, partially corrupted, and quantized measurements. It can be interpreted from the following aspects.
(1) Correction of corrupted measurements. We first fix the data loss rate f and consider the recovery performance with corrupted measurements. Suppose f is a constant, i.e., a constant fraction of the measurements are available. Then, (8) indicates that as long as the number of corrupted measurements s is at most md 2 p , we have 3 Thus, the recovery method tolerates a constant number of corrupted per column without degrading the recovery performance.
(2) Asymptotic recovery of the actual data.
decreases to 0 when m increases to infinity, and L * F is in the order of √ mn, (10) indicates that the relative error betweenL and L * diminishes asymptotically. Moreover, as long as p is o(n), the failure probability 1 − pC 1 e −C 2 ξ n/p also decays to zero as n increases to infinity. The asymptotic recovery differentiates the operating center and cyber intruders. An operating center with a sufficient number of measurements can recover L * accurately. In contrast, a cyber intruder with access to a small number of users cannot recover the data even using the same approach (3)-(5).
(3) Tolerance of the missing data. To the best of our knowledge, only refs. [28] and [31] provided the theoretical analysis of low-rank matrix recovery from quantized observations with data losses. No corruptions are considered in [28,31]. The relative recovery error by [28] is O where r is the rank of the matrix. The relative recovery error by [31] is O r 1/4 m 1/4 under the partial observation case. Our result in (10) indicates that when f is a constant, the error is at most O d 3 m even with corrupted measurements. Note that the rank of L * can be as large as pd when the subspaces are all orthogonal to each other. If one directly applies the approach in [37] to our setup, the relative recovery error can be as large as , which is p 3 times our recovery error. Thus, our approach outperforms the existing one by recovering and clustering data simultaneously even in the special case of no corruptions.
When there is no missing data, the recovery error by [27] is O d m , which is slightly tighter than our error bound in (10). This is due to our techniques to handle the missing data.

Fundamental limit of any recovery method
The following theorem establishes the minimum possible error by any method from unquantized measurements. We consider the case that the number of corruptions is at most a constant fraction of the measurements. To simplify the analysis, we assume where C 0 is a constant smaller than 1/2. Let Theorem 2 Let N ∈ R m×n contain i.i.d. entries from N 0, σ 2 . Assume (11) holds. Consider any algorithm that, for any X ∈ S fX , takes M ij = X ij + N ij , (i, j) ∈ as the input and returns an estimateX of X. Then, there always exists some X ∈ S fX such that with probability at least 3 4 , Note that C 3 is a constant. When f is a constant, (13) indicates that The recovery error from unquantized measurements is at least  (10), one can see that our method is close to optimal. If the corrupted entries are randomly distributed, s is approximately (fs). Then, the second term inside the minimization of (13) scales as 1 √ f d m .

Recovery of a single user from its own data only
An intruder is often interested in the data of a certain user. If the adversary only has access to one user's data, then problems (3)-(5) are reduced to Note that since n = 1, there is no constraint on C. (15) maximizes the log-likelihood of one user given the information about the quantized measurements. It can be viewed as a special case of the low-rank matrix recovery from quantized measurements considered in [37]. One can check that the average recovery error is upper bounded by O √ d 3 by setting n = 1 in Theorem 5 of [37]. Similarly, the relative recovery by any method is at least in the order of ( √ d) by setting n = 1 in Theorem 4 of [37]. This error bound does not depend on m, the number of measurements of this user. Therefore, if an intruder only has one user's data, even if m is very large, the average recovery error is nonzero and does not diminish as m increases. Then, the privacy of the energy consumption behavior of this user is protected.

Recovery of a single user by leveraging other users in the same group
One can exploit the measurements from other users to increase the estimation accuracy of one target user. Suppose one can access n users' data in m time steps, and these users all share similar load patterns as the target user, then from either Theorem 1 of this paper or Theorem 5 of [37], the average recovery error is at most O d 3 min(m,n) . Compared with the previous case of accessing the data of one single user only, the recovery error is significantly reduced. We emphasize that the decrease of the recovery error results from exploiting correlations among users.
The number of quantization levels K also affects privacy. Intuitively, a smaller value of K corresponds to a higher level of privacy. However, the privacy level also depends on the selection of bin boundaries, and decreasing K does not necessarily increase privacy. For instance, if a pair of boundaries are chosen very close to each other so that no measurements located within the interval, then K = 3 could reach the same privacy and recovery error as K = 2. Therefore, K does not directly appear in Theorem 1 but rather affects the privacy indirectly through γ α and L α . The bin boundaries usually tend to be closer in the region where the measurements concentrate.
For smart meter data, the bin boundaries can be selected in the range of a typical household consumption level. If a certain house has some electrical appliances with an energy consumption level significantly higher than normal households, this abnormal pattern of high energy consumption can in fact be masked in the noisy and quantized measurements due to the way how bin boundaries are selected. However, since this house has a different load pattern from other households, one cannot exploit other users' data to enhance the recovery accuracy of this user. The recovered data of this user will have a nonzero error as discussed in the first paragraph of Section 3.3.

Clustering guarantee
The clustering performance is evaluated through the subspace-preserving property ofĈ. A sufficient condition forĈ to be subspace-preserving is stated as follows. Ref. [27] also provides a sufficient condition forĈ to be subspace-preserving. The subspaces are required to be independent with each other in [27]. Two independent (2020) 2020:22 Page 13 of 36 subspaces intersect only at zero. Here, we require subspaces to be distinct from each other. Two subspaces are distinct if for each subspace, there exists one point that belongs to this subspace but not the other. The data points are generated based on some continuous distribution supported on these distinct subspaces.

Distributed sparse alternative proximal algorithm for data recovery and clustering
We next propose a distributed algorithm to solve (3) by W nodes collaboratively such that node i can estimate L * i from its acquired measurements Y i , while it does not know Y j or L * j for all other j's nodes. This further enhances user privacy. We first follow [27] and move some constraints to the objective function to simplify the algorithm design. Since the rank of L is at most r, we factorize L as L = UV T , where V ∈ R n×r and U ∈ R m×r . We replace the equality constraints L = LC and L = UV T by adding where The solution of (16) is the same as that of (3) when λ 1 and λ 2 approach the infinity. We next decompose V into W parts, and let V i ∈ R q×r , i ∈[ W ] denote the rows of V with row indices i . Then, the objective in (17) can be decomposed as follows: where Then, (16) can be equivalently written as where the estimated variables are U and W components of L, E, C, and V .
The constraints in (23) can be decomposed for W nodes, while the objective function cannot, due to the coupling of U and V. Here, we develop a synchronized Distributed Sparse Alternative Proximal Algorithm (DSAPA) to solve (23) with the convergence guarantee. The node i owns Y i and estimates V i , L i , E i , C i , and U. Since all nodes have the estimates of U, and L i = UV i , the key to protect user privacy of node i is not to share the estimate of V i , as well as Y i , to any other nodes.
In the (t + 1)th iteration, node i sequentially updates C t+1 i , V t+1 i , L t+1 i , E t+1 i , U t+1 in Subroutines 1-5. Each subroutine essentially follows the projected gradient. The gradient of H with respect to V i , L i , E i , U, and C i are where The step sizes in the (t + 1)th iteration are selected as and where e U = λ 2 U t T U t These step sizes are no greater than the reciprocals of the smallest Lipschitz constants of ∇ C i H, ∇ V i H, ∇ L i H, ∇ E i H, and ∇ U H in the tth iteration, respectively. Details of the calculations are shown in Appendix 6. This property is useful for the convergence analysis of the DSAPA. The constraints in (22) are met by projecting the updated estimates to SF i . For the constraints on C i , in steps 10-15 of Subroutine 1, we first set diagonal entries of (C i ) t+1 i to zero. Then, we keep the d entries with the largest absolute value of (C i ) t+1 j and set all other entries to zero for any j ∈[ q]. The infinity norm on L i is met by setting all entries larger than α 1 to be α 1 and setting all entries smaller than −α 1 to be −α 1 (step 4 in Subroutine 3). A similar approach applies to E i . We also keep s W entries with the largest absolute values and set other nonzero entries to zero (steps 3-6 in Subroutine 4).
Note that L i and E i can be updated by node i independently and are not shared with other nodes. Updating C i , V i , and U needs communication from other nodes due to the coupling in the objective function. V i cannot be shared with other nodes, since otherwise other nodes can estimate L i by multiplying U and V i . Thus, node i computes the intermediate terms that depend on V i and send to other nodes instead of sending V i , as illustrated in Fig. 3.
The algorithm is initialized as follows. L 0 i in node i is defined as,  Then, node i performs the truncated singular value decomposition on L 0 i and let U i to all other nodes. Each node initializes at The convergence of DSAPA is summarized as follows.

Theorem 3 From any initial point, DSAPA always converges to a critical point of (23).
The For data clustering, a central node collectsĈ i from all the nodes and applies spectral clustering [41] to obtain the clustering results.
When λ 1 and λ 2 are large enough, (23) approximates (3), but the step sizes in (30)- (32) and (34) are small and that reduces the convergence rate. One practical solution is to dynamically increase λ 1 and λ 2 [55]. We suggest the following practical selection. Initialize with small λ 1 and λ 2 , and replace λ 2 with ρλ 2 (ρ > 1) for the first T 0 iterations. Then, reset λ 2 to the initial value and update them with ρλ 1 and ρλ 2 simultaneously in each iteration. The algorithm terminates after T iterations.  (24). 8: Compute τ C t i according to (30).
Subroutine 3 Iterate L i in the i-th distributed node 1: Compute τ L i by (32). 2: Compute ∇ L i H according to (26).

Results: numerical experiments
We evaluate the performance on the Irish smart meter dataset (ISMD) [56] and the UMass smart * microgrid dataset (USMD) [57]. The ISMD consists of more than 5000 residential customers. The measurements are obtained every 30 min and have a unit of kilowatt (kW). The UMSD contains 443 users in 24 h, and the power consumption is measured every minute. Some users have long sequences of zero power consumption, and some users have significantly high power consumption occasionally. We suspect these measurements have data quality issues resulting from devices or communication and remove these users from the datasets. We use 4780 customers in 30 days for ISMD and 438 customers in 6 h for USMD. Thus, the size of the data matrix L is 1440×4780 for ISND and 360×438 for USMD. The power consumption is at most 6 kW and 99 kW, respectively. Since the raw measurements are noisy, L is approximated by a rank-r matrix L * rank-r by keeping only the largest r singular values. The recovery error is measured by L * rank-r −L 2 F / L * rank-r 2 F , whereL is the recovered matrix. We choose r to be about 10% of the total number of the singular values. Then, r is set to 150 for ISMD and 40 for USMD. The following experiments are tested on ISMD, if not otherwise specified.
As described in Section 2.3, normalized mutual information is used to measure the data privacy. We now calculate the average normalized mutual information of 4780 userŝ As a comparison, we also calculate the normalized mutual information between the noisy data (before quantization) and the actual data. The quantization level K is chosen as 2 or 5. The quantization boundaries and quantized values are summarized in Table 2 (K = 2, 5). We place more boundaries in the region where data concentrate. Selecting the optimal quantized boundaries is beyond the scope of this paper and will be left for the future work. We believe these parameters can be optimized if a small portion of ground-truth data are available for training. The noise level σ varies from 0.1 to 0.4 with a step size of 0.02. To compute the probabilistic distribution of L i , we divide the range 0-6 kW into 100 or 300 equal intervals and compute the empirical distributions. As shown in Fig. 4, the normalized mutual information between the power after quantization and the actual power consumption is always smaller than that between the noisy value before quantization and the actual power consumption. This indicates the proposed quantization process enhances the data privacy. In addition, the norma- lized mutual informationNI decreases when either K decreases or σ increases. That is consistent with the intuition. Since no ground-truth clustering result exists for this dataset, we define an index CI to evaluate the clustering performance. Let a j denote the maximum angle of all the data points in group j to the estimated subspace of this group. Let b j denote the minimum angle of any point in group j to the other subspaces. The clustering index CI measures the clustering accuracy and is defined as CI is large if a j 's are small and b j 's are large, which means that points in the same group are close to the subspace of that group and away from other groups. A larger CI corresponds to a better clustering result. We apply Sparse Subspace Clustering (SSC) [21] to this dataset with different cluster numbers and compare the resulting CI's. We use the Alternating Direction Method of Multipliers (ADMM) [58] to solve SSC. When the number of clusters is p = 4, we obtain the maximum CI = 0.085. Thus, we set the number of clusters to be 4 in the following experiments. We generate corruptions E * and noise N randomly. The nonzero entries of E * are selected from [ −4, −0.5] and [ 0. 5,4] uniformly. Every entry of N is drawn from the N (0, 0.3 2 ). The quantization level K is set to 5. The locations of the missing data are selected randomly. The simulations run in MATLAB on a computer with 3.4 GHz Intel Core i7.
We evaluate DSAPA on the quantized measurements. We choose W = 5 agents. We assume the upper bound of the magnitudes of the sparse error and the power consumption are known. For simplicity, we use the largest value of the given error and set α 2 = 4. Similarly, we set α 1 = 6. We set d = 50. λ 1 and λ 2 are initialized to be 0.5, and ρ = 1.05. The maximum iteration number T is set to be 200. T 0 is set to be 40.
Here, d is selected to be approximately r/(p − 1). We use p − 1 considering the overlap between subspaces. We remark that varying d around the selected value does not affect the result. λ 1 and λ 2 are self-adjusted in our algorithm as discussed in the last paragraph of Section 4. Figure 5 shows the energy consumption of a single user in 24 h. It compares the actual data, the rank-150 approximation of the actual data, the quantized observations, the recovered data by DSAPA, and the average quantized data of the users in the same group. One can see that the rank-150 approximation of the actual data has a similar pattern to the actual data. Clearly, the details of power consumption are hidden in the quantized measurements. For instance, the two peak consumptions are no longer visible in quantized measurements. Thus, an intruder does not know the user pattern if only accessing the quantized measurements of that user only. On the other hand, DSAPA recovers the power consumption trend accurately from the quantized data. The two peak loads are accurately identified in the recovered data as shown in Fig. 5. The recovered data can be used for grid planning.
After obtainingĈ using DSAPA, we implemented spectral clustering [41] to cluster the data points. To visualize the recovered consumption pattern of users in each group, we normalize the power consumptions and compute the average of users in the same group. Figure 6 shows the average profile obtained by our method in 1 day (no missing data and with 15% missing data). For comparison, the mean daily profile of the ground-truth data clustered by SSC is also shown in Fig. 6. One can see that the data losses do not affect the recovery performance of DSAPA. The recovered patterns are close to the actual patterns obtained by SSC, considering that the measurements are highly noisy and quantized. Now we pick some users in the same group and average the quantized value (K = 5) of these users. We calculate the normalized mutual information between one user and the averaged quantized value of the selected users. Figure 7 shows the normalized mutual information when the number of selected users varies. The value does not decrease much when the number of users increases. Compared with Fig. 4b, one can see that the averaged quantized value of the same group does not provide much information to the single user. We compare DSAPA with Approximate Projected Gradient Method (APGM) [28] and Quantized Robust Principal Component Analysis (QRPCA) [35] for data recovery in Fig. 8a. We apply SSC on the recovered data by APGM (or QRPCA) to obtain the clustering result, labeled by "APGM + SSC" ("QRPCA + SSC") in Fig. 8b. If we simply use the quantized value Q 1 , Q 2 , · · · , Q 5 to estimate the actual power consumption, the relative recovery error is 0.869, which is much larger than the results in Fig. 8a. When the missing data rate changes from 0 to 0.4, our method always outperform the other methods both in data recovery and data clustering. For comparison, CI = 0.085 for SSC on the groundtruth data, and CI = 0.05 for a random clustering. Our method achieves CI = 0.08 using quantized measurements with 5% corruptions and no data losses. We vary the number of users by randomly selecting a subset of the 4780 users. Under the 15% missing rate and no corruption, Fig. 9 shows the recovery error when the number of users varies. The recovery error is 0.35 when the user number is to 500 and decreases to 0.2 when there are 2500 users.
We test the case when no additional noise is added before quantization. We vary the estimated noise level when implementing DSAPA since the measurements usually contain observation noise. As shown in Fig. 10, DSAPA can recover the data with no additional noise. However, adding no noise can lead to a low privacy level. The normalized mutual information when K = 2 and K = 5 are 0.2862 and 0.9579, respectively (0.02 kW per interval). These values are much higher than those shown in Fig. 4, indicating a lower level of privacy when no noise is added.
In Fig. 11, we compare the relative recovery error and the clustering index CI of DSAPA and the centralized algorithm Sparse-APA in [27]. Since Sparse-APA does not consider missing data, we study the case with full observations. The corruption rate is set as s/mn = 5%. The recovery error of Sparse-APA is small than our method when the algorithm initializes, because Sparse-APA can compute a better initialization in a centralized fashion. However, the difference decreases as the iteration number increases. After 200 iterations, both algorithms perform similarly. We next show the performance of DSAPA on USMD. Since the measurements vary from 0 to 100 kW, we set K = 7, and α 1 = 50. The quantization boundaries and quantized values are in Table 2 (K = 7). p and d are set to be 4 and 15, respectively, using the same technique as discussed in the previous experiments. We generate the corruptions E * and the noise N randomly. The nonzero entries of E * are selected from [ −10, 10] uniformly, and the corruption rate is 5%. Every entry of N is drawn from the N (0, 0.3 2 ). Similar to Figs. 5 and 8a, we show the results on USMD in Fig. 12.   Fig. 11 Comparisons between DSAPA and Sparse-APA (centralized) [27] (2020) 2020:22 Page 24 of 36 Fig. 12 a Comparison of actual data, rank-50 approximation, recovered data, and quantized data (USMD). b Relative recovery error when the missing rate changes

Conclusion and discussions
This paper for the first time shows that the two seemingly contradicting objectives of data privacy and information accuracy of smart meter data can be achieved simultaneously.
The central technical contribution is the development of a decentralized data recovery and clustering method from highly quantized, partially lost, and partially corrupted measurements. Distributed nodes do not share raw data with each other and cannot estimate the actual data of other nodes. We propose a Distributed Sparse Alternative Proximal Algorithm (DSAPA) with a convergence guarantee to solve the nonconvex problem. The recovery error of our method is nearly optimal. The method is evaluated on actual smart meter datasets. Future works include leveraging the time correlation within each user to further improve the method and developing unsynchronized decentralized data recovery algorithms.

Appendix 1
Supporting lemmas used in the Proof of Theorem 1 Lemma 1 Under Assumptions 1 and 2, the following inequalities hold where a =  where (a) holds from Lemma 8 and Lemma 9 in [28], and the assumption n i ≤ ξ n/p.
, andX, X * ∈ S fX . Follow the same assumptions as those of Theorem 1. Then, with probability at least 1 − pC 1 e −C 2 ξ n/p , holds for the positive constants C 1 and C 2 . ., . denotes the inner product of two matrices, i.e., the sum of entry-wise products.
Proof The proof is generalized from the proof of Lemma 2 in [27] which does not consider missing data. Here we extend the analysis to handle missing data. According to the definition, there exists a permutation matrix * such that L * can be written as L * = L * 1 , L * 2 , ..., L * p * . By Assumption 2,L can be written asL = L 1 ,L 2 , ...,L p * , where the dimension ofL i is smaller or equal to (g − 1)d.
Combining them with (7), one can conclude that the elements of L −1 α ∇ X F X * i have zero mean, and the variances are bounded by one. Using the result of Lemma 1 in [27], we have holds with probability at least 1 − C 1 e −C 2 ξ n/p . X * i is the same ith group as L * i under the permutation * . Then holds with probability at least 1 − pC 1 e −C 2 ξ n/p . (a) holds from the linearity of the inner product. The first term of (b) holds from | A, B | ≤ A 2 B * . The second term of (b) holds from the fact that bothÊ, E * have at most s nonzero entries and |∇ X F(X * ) i,j | ≤ 1. (c) holds from (45) s, which results from the fact that |Ê i,j − E * i,j | is bounded by 2α 2 . The probability 1 − pC 1 e −C 2 ξ n/p comes from the union bound for P(max i∈ [p]

Appendix 2
Proof of Theorem 1 Proof The proof follows and extends the proofs of Theorem 1 in [28] and Theorem 5 in [37]. We extend from the low-rank matrices in [28,37] to matrices with columns in p lowdimensional subspaces. Moreover, ref. [28] does not consider corruptions, and ref. [37] does not consider missing data. Here we consider both missing data and corruptions.
(2020) 2020: 22 Page 27 of 36 The first bound 2α 1 + 2α 2 s mn in (8) follows from the fact thatL, L * ,Ê, E * ∈ S f . We discuss the second bound in (8) as follows. We denote (4) to be F(X) when we treat X to be the variable. Note that S fX is a compact set, and the objective function is continuous in X. F(X) then achieves a minimum in S fX . Suppose thatX ∈ S fX minimizes F(X).

Appendix 6
DSAPA: proof of the Lipschitz differential property and calculation of Lipschitz constants A function is Lipschitz differentiable if and only if all its partial gradients are Lipschitz continuous. The definition is shown in Definition 3.
We provide the Lipschitz differential property of H and compute the corresponding Lipschitz constants of its partial gradients with respect to C i , V i , L i , E i , ∀i ∈[ W ]. Let L t+1 p1 , L t+1 p2 , L t+1 p3 , L t+1 p4 , and L t+1 p5 denote the smallest Lipschitz constants of ∇ C i H, ∇ V i H, ∇ L i H, ∇ E i H, and ∇ U H in the (t + 1)th iteration. We have where (a) follows from (30). Equation (69) implies that where (b) follows from the triangle inequality, and (c) follows from (31 where (d) comes from the differential mean value theorem. ∇ 2 F(L i ) ∈ R m×q has the (k, j)th entry equaling to ∂ 2 F , and diag(∇ 2 F(L i )) ∈ R mq×mq is a diagonal matrix with the diagonal vector equaling to vec(∇ 2 F(L i )). (e) follows from the fact that the l 2 norm of a diagonal matrix is equal to its entry-wise infinity norm. Note that (1) is lower bounded by β, and the probability density function of the normal distribution and its derivative are upper bounded by 1 √ 2πσ and e −1/2 √ 2πσ 2 , respectively. Then, one can easily check that ∇ 2 F(L i ) ∞ is bounded by 1 σ 2 β 2 . (f ) is thus obtained by upper bounding ∇ 2 F(L i ) ∞ . (g) follows from (32).