 Research
 Open Access
 Published:
Achieve data privacy and clustering accuracy simultaneously through quantized data recovery
EURASIP Journal on Advances in Signal Processing volume 2020, Article number: 22 (2020)
Abstract
This paper develops a data collection and processing framework that achieves individual users’ data privacy and the operator’s information accuracy simultaneously. Data privacy is enhanced by adding noise and applying quantization to the data before transmission, and the privacy of an individual user is measured by informationtheoretic analysis. This paper develops a data recovery and clustering method for the operator to extract features from the privacypreserving, partially corrupted, and partially observed measurements of a large number of users. To prevent cyber intruders from accessing the data of many users, it also develops a decentralized algorithm such that multiple data owners can collaboratively recover and cluster the data without sharing the raw measurements directly. The recovery accuracy is characterized analytically and showed to be close to the fundamental limit of any recovery method. The proposed algorithm is proved to converge to a critical point from any initial point. The method is evaluated on recorded Irish smart meter data and UMass smart microgrid data.
Introduction
Smart meters provide finegrained measurements of power consumption of industrial and residential customers and can enhance the distribution system visibility. Nonintrusive load monitoring (NILM) approaches [1, 2] can identify individual appliances from the hightimeresolution smart meter data of the aggregated power consumption. Intruders can thus extract user behavior, and user privacy is an increasing concern. One way to protect data privacy is by applying additive homomorphic encryption [3]. It requires the network to have treelike connections and can only decrypt the sum of the load curves. The other way to enhance data privacy is data obfuscation whereby the actual power consumption of each household is masked by adding noise to the smart meter measurements either through signal processing approaches [4, 5] or by physically adding rechargeable batteries to the households [6, 7]. Moreover, the aggregated consumption of the load and the battery can be adjusted to a constant to obfuscate the information further [8, 9]. Then, applying the NILM to these noisy and quantized measurements, an intruder can no longer accurately identify the patterns of individual appliances and, in turn, the user behavior. The increase in user privacy is achieved, however, at a cost of data distortion and reduced data accuracy for the operating center [10–12]. Although the operating center does not need hightimeresolution information of every individual appliances in each household, it still requires accurate estimation of the aggregated power consumption and the common load patterns among households for forecasting, demand response, and planning. For example, the center clusters customers with similar load patterns and then employs the load pattern of each cluster to enhance the load forecasting accuracy [13] and determine the incentives for demand response [14, 15]. If noise and quantization are added to the data to enhance the privacy, the information accuracy for the operator is effectively reduced.
This paper shows that the data privacy can be protected for each individual user^{Footnote 1} and, at the same time, the information accuracy at the operating center about user power consumption and the major patterns among different users are maintained. To the best of our knowledge, this is the first work that achieves data privacy and information accuracy simultaneously. In our proposed framework, each user’s actual power consumption is masked by first adding noise to the measurements and then quantizing the output to one of a few levels. The privacy of an individual user can be enhanced in this way, from an informationtheoretic perspective [16–19]. Once the data is quantized, the variation information is blurred and hence NILM methods fail to identify individual appliances. Although adding noise and quantization have been employed before to enhance privacy (e.g., [6, 20]), this paper, for the first time, shows that such privacy enhancement does not necessarily lead to a reduction in the information accuracy. The central technical contribution of this paper is the development of a data recovery and clustering method, even when the measurements are highly noisy and quantized, contain significant errors, and are partially lost. Our method is proved to provide accurate data recovery and clustering results, as long as the center has measurements from a sufficient number of users. In contrast, a cyber intruder with access to the measurements of a small number of users cannot obtain accurate information even with the same approach. We develop a decentralized algorithm that allows multiple data owners to cooperatively recover and cluster the data without sharing their own raw measurements directly. Then, it is extremely difficult for an intruder to access large amounts of data. Thus, the data privacy of an individual user is enhanced while maintaining the information accuracy for the operating center.
Since the load profiles with similar load patterns can be represented by data points in a lowdimensional subspace in the highdimensional ambient space, all the load profiles can be characterized by the Union of Subspaces (UoS) model [21], and the load clustering problem can be formulated as a subspace clustering problem. Various subspace clustering techniques have been developed, see e.g., [21–26]. None of these approaches, however, considers the case that the measurements are highly quantized. To the best of our knowledge, only one recent work considered subspace clustering and data recovery from highly noisy and quantized data [27]. This paper follows the mathematical setup of [27] but extends significantly in the following aspects. Ref. [27] does not consider data privacy, while this paper proposes a data collection framework to achieve data privacy and information accuracy simultaneously. We characterize the data privacy through mutual information, and such analysis does not exist in [27]. Ref. [27] assumes that all the measurements are available to the center, while this paper considers a more general setup that partial measurements are lost during the transmission and do not arrive at the center. This paper characterizes the data recovery error by our proposed method analytically as a function of data loss percentage. Moreover, this paper characterizes the fundamental limit of the recovery error by any possible recovery method and shows that our method is nearly optimal in reducing the recovery error. All these fundamental analyses do not exist in [27]. Furthermore, only a centralized algorithm SparseAPA is discussed in [27]. This paper develops a Distributed Sparse Alternative Proximal Algorithm (DSAPA) for multiple data owners to collaboratively solve the subspace clustering and data recovery problem without sharing the measurements with others. Thus, the user data privacy can be further protected. This paper is also related to the quantized matrix recovery problem [28–36], in which the data matrix is assumed to be low rank. The lowrank matrix model is a special case of the UoS model by restricting to one subspace only. In fact, the data matrix of the load profiles can be high rank or even full rank in our setup. Finally, we remark that this paper considers smart meter measurements that measure the aggregated energy consumption in a house, and does not consider applying NILM on the operator side. Distributed smart metering can provide energy consumption of individual electrical appliances in a house [20].
The rest of the paper is organized as follows. Section 2 introduces our proposed framework, problem formulation, related works, and the data privacy enhancement analysis. The theoretical analyses of our recovery and clustering method is presented in Section 3. Section 4 introduces the details of the DSAPA with its convergence guarantee. Section 5 records the numerical experiments of our method on the real smart meter dataset. Section 6 concludes the paper. All the proofs are deferred to Appendix 1, Appendix 2, Appendix 3, Appendix 4, Appendix 5, Appendix 6, and Appendix 7.
Our proposed framework of privacypreserving data collection and information recovery
Our framework and problem formulation
Figure 1 visualizes our proposed framework of privacypreserving smart meter data collection and information recovery. To enhance the user data privacy, the actual power consumption is mapped to a few fixed power levels at the output of the smart meter. One can achieve this through signal processing in the smart meter or connecting a rechargeable battery to each household. Thus, the actual consumption is masked in the noisy and quantized smart meter measurements. As shown in Fig. 1, the measurements are collected by W agents disjointly, and agents do not share measurements directly. The agents recover the data and cluster the users with similar consumption patterns collaboratively in a distributed fashion. When W=1, it reduces to the case of one single center.
We defer the discussion of user privacy enhancement through the proposed framework to Section 2.3. We first define the recovery and clustering problem from quantized data mathematically as follows. \(L^{*} \in \mathbb {R}^{m \times n}\) denotes the actual power usages of n users, with each column containing the power usage of one user in m time instants. We assume that users with similar consumption patterns belong to the same group and there are p groups in total. The corresponding columns of the same group belong to a ddimensional subspace in \(\mathbb {R}^{m}\) with d≤m. Let S_{i} (i∈[p]) denote the ith subspace, and these p subspaces are distinct^{Footnote 2}. Let r denote the rank of L^{∗}, then r≤pd. Let \(L^{*}_{i}\) denote the submatrix of L^{∗} that contains points in S_{i}, and let n_{i} denote the number of columns in \(L^{*}_{i}\), i.e., the number of users in group i. We assume m≤n_{i}≤ξn/p for all i and some positive constant ξ. We further assume m=n/κp for some positive constant κ to simplify the representation of main results.
There exists a coefficient matrix \(C^{*}\in \mathbb {R}^{n \times n}\) such that L^{∗}=L^{∗}C^{∗}, \(C^{*}_{i,i}=0\) for all i∈[n]. Moreover, \(C^{*}_{i,j}\) is zero if the ith and jth columns of L^{∗} do not belong to the same subspace [21]. We summarize these two properties as selfexpressive property and subspacepreserving property in Definition 1. These properties have been exploited in the literature of subspace clustering and are summarized as follows.
Definition 1
[27] A matrix \(L \in \mathbb {R}^{m \times n}\) has the selfexpressive property if L=LC for some \(C \in \mathbb {R}^{n \times n}\), and C_{i,i}=0 for all i∈[n]. Moreover, C has the subspacepreserving property of L if C_{i,j}=0 for columns i and j of L belonging to different subspaces.
Let matrix \(E^{*} \in \mathbb {R}^{m \times n}\) denote the additive errors in the measurements. We assume the number of nonzeros s in E^{∗} is much smaller than mn. The partially corrupted measurements can be represented by X^{∗}=L^{∗}+E^{∗}. We assume the energy consumption and the errors are bounded, i.e., ∥L^{∗}∥_{∞}≤α_{1} and ∥E^{∗}∥_{∞}≤α_{2}, for some positive constants α_{1},α_{2}, and the infinity norm ∥·∥_{∞} measures the maximum absolute value.
The quantization process in each household is modeled as follows. The measured energy consumption at each time step is mapped to one of K values in a probabilistic fashion. Figure 2 shows the quantization process. It can be modeled as adding random noise first and then quantizing to K levels. \(N \in \mathbb {R}^{m \times n}\) is independent from X^{∗}. Entries of N are i.i.d. generated from a fixed cumulative distribution function (c.d.f.) Ψ(x). The quantization boundaries ω_{0}<ω_{1}<...<ω_{l−1}<ω_{l}...<ω_{K} and the quantized value \(\mathcal {Q}_{l}, l \in [K]\) for the bin [ω_{l−1},ω_{l}) are given. Then, the probability of mapping \(X^{*}_{i,j}\) to \(Y_{i,j} = \mathcal {Q}_{l}, \forall i,j\) is represented by
and \(\sum _{l=1}^{K}{\varphi }_{l}\left (X^{*}_{i,j}\right)=1\). The noise N is introduced to hide the user information. One choice of Ψ(x) is the probit model with Ψ(x)=Ψ_{norm}(x/σ), where Ψ_{norm} is the c.d.f. of the standard Gaussian distribution \(\mathcal {N}(0,1)\), and σ>0 is the standard deviation. Note that \(\Psi \left (\omega _{l}X^{*}_{i,j}\right) \geq \Psi \left (\omega _{l1}X^{*}_{i,j}\right)+\beta \) for some positive β. Then, 1≥φ_{l}≥β>0.
The quantized measurements Y are sent to the center. Data losses can happen during the communication, visualized by the question marks in Fig. 2. Let set Ω denote the indices of measurements that are not lost. In the general case that the measurements are collected by W agents/nodes separately, we assume for simplicity that each node collects the data from q=n/W users. Node 1 collects the data from the first q users; node 2 collects the next q users and so on. Let Φ_{i}={q(i−1)+1,q(i−1)+2,...,qi}, then \(L^{*}_{\Phi _{i}}\) denotes the submatrix of L^{∗} with column indices in Φ_{i}. L^{∗} can also written as \(\left [L^{*}_{\Phi _{1}},L^{*}_{\Phi _{2}},...,L^{*}_{\Phi _{W}}\right ]\). Similarly, \(E^{*}=\left [E^{*}_{\Phi _{1}},E^{*}_{\Phi _{2}},...,E^{*}_{\Phi _{W}}\right ]\). Node i collects \(Y_{\Phi _{i}}\).
The data recovery and pattern extraction problem for one center can be stated as follows.
(P1) Given quantized measurementY_{Ω}, known boundariesω_{0}<ω_{1}<...<ω_{K} and noise distribution Ψ, can we recover the real power usages L^{∗} and cluster the users through estimating C^{∗} simultaneously?
Moreover, if measurements Y_{i,j}’s are not shared among W nodes to protect the user privacy,
(P2) Can we estimate L^{∗} and C^{∗} with W nodes in a decentralized fashion?
Some notations in this paper are summarized in Table 1.
Related work
When p=1, i.e., all the users share the same pattern, L^{∗} is approximately a lowrank matrix. Then, (P1) reduces to the problem of lowrank matrix recovery from quantized measurements [28–37], with motivating applications in image processing [38], collaborative filtering [31], and sensor networks [39]. Note that since there is only one subspace in this case, these works do not consider data clustering and only focus on data recovery.
When the quantization process does not exist, the problem (P1) reduces to the conventional subspace clustering problem [21–26,40]. If the subspace preserving C^{∗} is estimated, one can apply the spectral clustering [41] method to obtain the clustering of the data points. For example, Sparse Subspace Clustering (SSC) [21] is a common choice for subspace clustering, and SSC estimates C^{∗} by solving a convex optimization problem. Other clustering methods exist that cluster data points based on the Euclidean distance. For instance, refs. Lin et al. [42] and Keogh et al. [43] leverage a linear combination of box basis functions to approximate the original data, yet still retain the features of interest.
Reference [27] is the first paper that studies the subspace clustering from quantized measurements when p≥1. Wang et al. [27] do not consider missing data and develop a centralized data recovery method from full observations. This paper follows the same problem formulation as [27] and extends to the general case of partial observations. We provide both the recovery guarantee of our approach and the fundamental limit of the recovery accuracy by any method. Moreover, a framework of privacypreserving smart meter data collection is proposed in this paper, and we further enhance the data privacy by developing a decentralized data recovery method.
Our problem formulation and methods apply to other domains such as image and video processing and phasor measurement unit (PMU) data analytics for power systems. In image recovery and image clustering [27], images of the same person with varying illumination belong to the same lowdimensional subspace [44]. Columns of L^{∗} correspond to images of multiple people. The goal is to enhance the image quality and cluster the data using lowresolution images. Similarly, in motion segmentation, each column of L^{∗} represents the trajectory of a reference point. The reference points in the same rigid object belong to the same subspace. The motion segmentation becomes a subspace clustering problem from the observed measurements. In PMU data analytics, the time series of PMUs affected by the same event belong to the same subspace [32,45]. The event location problem can be solved by subspace clustering.
Data privacy enhancement in the proposed framework
Various methods have been developed to enhance the privacy of power consumption data. For example, one can use preprocessing techniques like temporal averaging, adding additional noise, and quantization [4,5,20] to alter the data. However, directly altering data might affect the accuracy of some applications, e.g., billing and profiling [46]. Alternatively, rechargeable batteries and PV converter can be leveraged to mask the actual power consumption [6,7,47]. The noise addition and quantization process in this paper can be achieved by either signal processing or rechargeable batteries.
In general, privacy guarantee can be achieved through either computational hardness [48–50] or informationtheoretic analysis [16–19]. The existing analytical results of data privacy only work for specific or simple models and do not easily generalize. For instance, under the setup of communication between two nodes, ref. [17] analyzes the tradeoff between data sharing and privacy. Under the assumptions of i.i.d. input load sequence and an i.i.d. energy harvesting process, the minimum information leakage rate is provided with a certain energy management policy in [51]. Some other methods try to analyze data privacy numerically. In [52], the information leakage rate is measured by the relative entropy of the probability measures of the original load data and the modified load data and is calculated by MonteCarlo method. Refs. [7] and [12] consider measuring the information leakage through mutual information of the original load data and the modified load data. Following the existing work on smart meter data privacy, see, e.g., [19,52–54], this paper analyzes the data privacy from an informationtheoretic perspective. The data privacy of an individual user is analyzed by comparing the original data and the data after privacy enhancement through quantities like the KullbackLeibler (KL) divergence [52], mutual information, and normalized mutual information [18]. In our framework, the actual energy consumption of user i, denoted by \(L^{*}_{\star i}\), is masked by additive Gaussian noise and quantization, resulting in Y_{⋆i}. Let \(P_{L^{*}_{\star i}}\) and \(P_{Y_{\star i}}\) denote the probability distribution of \(L^{*}_{\star i}\) and Y_{⋆i}, respectively. The privacy can be measured through the normalized mutual information (NI) between \(L^{*}_{\star i}\) and Y_{⋆i} [18], defined as follows:
Definition 2
where spaces \(\mathcal {X}\)and \(\mathcal {Y}\) are the feasible set of \(L^{*}_{\star i}\) and Y_{⋆i}, respectively. \(P_{\left (L^{*}_{\star i},Y_{\star i}\right)}\) is the joint distribution of \(L^{*}_{\star i}\) and Y_{⋆i}. \(P_{L^{*}_{\star i}}\) and \(P_{Y_{\star i}}\) are the marginal distributions of \(L^{*}_{\star i}\) and Y_{⋆i}, respectively.
The numerator of (2) is the mutual information between \(L^{*}_{\star i}\) and Y_{⋆i}, and the denominator is the entropy of \(L^{*}_{\star i}\). When \(L^{*}_{\star i}\) and Y_{⋆i} are independent of each other, \(NI\left (L^{*}_{\star i}, Y_{\star i}\right)\) reaches its minimum value 0. When Y_{⋆i} is exactly the same as \(L^{*}_{\star i}\), \(NI\left (L^{*}_{\star i}, Y_{\star i}\right)\) equals to the maximum value 1. A smaller NI corresponds to a higher level of data privacy of \(L^{*}_{\star i}\) and also indicates more significant difference between \(L^{*}_{\star i}\) and Y_{⋆i}. Note that rigorously speaking, \(L^{*}_{\star i}\) belongs to the continuous space. However, since all measuring devices have a finite resolution, \(L^{*}_{\star i}\) can be viewed as a discrete random variable. When computing NI in practice, one can divide the range of the values into small regions to compute sample probability distribution.
The above informationtheoretic measures show that when the data of individual users are processed separately, a user’s data privacy is enhanced at the cost of reduced information accuracy. We need to emphasize that the measures like NI or KL divergence focus on an individual signal and do not characterize the information recovery when multiple signals are processed together. In fact, when the data of multiple users are available, and strong correlations exist among different users’ data, such correlation can be leveraged to enhance the data accuracy. As stated in problems (P1) and (P2), the major technical objective of this paper is to develop data recovery and clustering methods from quantized measurements of multiple users, where the data correlations are characterized by data points belonging to the same subspace. As we will show in Section 3 (Theorem 1 and Proposition 1), the asymptotic information accuracy from quantized measurements can be achieved when the number of users increases to the infinity. We need to emphasize that this result does not contradict the data privacy enhancement by adding noise and applying quantization. This is because the asymptotic information accuracy is only achieved when processing the correlated data of a large number of users, while a cyber intruder is very unlikely to have access to the data of so many users. In our proposed decentralized data collection and processing framework (Fig. 1), each agent collects the measurements of a subset of users, and the measurements are not directly shared among the agents. A cyber intruder needs to hack either all these agents or the smart meters of all users to be able to access all the data. Since such attack is very unlikely to happen, the user’s data privacy is still protected. Privacy from the recovery perspective will be discussed in details in Section 3.3.
Results: theoretical
Here, we consider solving (P1) at a single center and defer the discussion of solving (P2) in a decentralized way through distributed nodes to Section 4. We propose to estimate L^{∗}, C^{∗}, and E^{∗} by the solution \(\left (\hat {L},\hat {E},\hat {C}\right)\) to the following optimization problem,
where
1_{[A]} is the indicator function that takes value 1 if A is true and value 0 otherwise. ∥·∥_{0} measures the number of nonzero entries in a vector or matrix. Data recovery and subspace clustering are achieved simultaneously by solving (3)–(5).
Equations (3)–(5) are a constrained maximum loglikelihood estimation problem that maximizes the likelihood of obtaining Y_{Ω} when the underlying data matrix is \(\hat {L}\), and the error matrix is \(\hat {E}\). The formulation follows (8) of [27] by extending from full observations to partial observations in Ω. After obtaining \(\hat {C}\), spectral clustering [41] is applied to \(\hat {C}\) to obtain group labels.
Equations (3)–(5) are nonconvex due to the nonconvexity of the feasible set \(\mathcal {S}_{f}\) in (5). We first analyze the recovery and clustering performance, assuming that a solution exists. We defer the algorithm to Section 4.
Data recovery guarantee
Two constants γ_{α} and L_{α} are needed for the recovery analysis,
where \(\dot {{\varphi }}_{l}(x)\) and \(\ddot {{\varphi }}_{l}(x)\) are the first and secondorder derivatives with respect to x. Note that \(\dot {{\varphi }}_{l}(x)^{2}  \ddot {{\varphi }}_{l}(x)\varphi _{l}(x) > 0\) if φ_{l} is strictly logconcave. One can check that φ_{l} is strictly logconcave if Ψ is logconcave, which holds true for Gaussian and logistic distributions [28]. L_{α} and γ_{α} are bounded by some fixed constants when α_{1}, α_{2}, and φ_{l} are given.
Since the data recovery performance and the clustering performance are coupled together, we first analyze the recovery performance, assuming that the clustering results are not “arbitrarily bad.” We follow the same assumption as [27], which essentially requires that in the estimated clustering results, every cluster contains data points belong to at most a constant number out of p original subspaces. Formally, we have
Assumption 1
[27]: Columns of \(\hat {L}\) belong to \(\hat {p}\) subspaces, each of which has a dimension smaller or equal to d. Columns in \(\hat {L}\) with indices corresponding to columns of L^{∗} in S_{i}(i∈[p]) belong to at most (g−1) subspaces, where g is a constant larger than 1.
We follow the assumption in [28] about the location of the observed entries. We make a minor change to handle multiple subspaces instead of one subspace in [28]. Assumption 2 is a generalization of the uniform sampling and includes the uniform sampling as a special case. We define a binary matrix G with G_{i,j}=1 if and only if (i,j)∈Ω, i.e., Y_{i,j} is observed. G_{i,j}=0 otherwise. Let \(G_{i} \in \mathbb {R}^{m\times n_{i}}\) denote the submatrix of G with columns corresponding to subspace i.
Assumption 2
Assume each column of G_{i} has h nonzero entries. Let σ_{1}(G_{i}) and σ_{2}(G_{i}) denote the largest and the second largest singular values of G_{i}, respectively. Assume σ_{1}(G_{i})≥h and \(\sigma _{2}(G_{i}) \le \mathcal {C} \sqrt {h}\) for i∈[p], where \(\mathcal {C}\) is a positive constant.
Assumption 2 is similar to the sampling assumption in [28]. The difference is that we make the assumption on columns belonging to each subspace instead of the whole matrix. The above assumption is more general than the uniform sampling assumption [28].
Theorem 1
Suppose that φ_{l}(x)is strictly logconcave in x, ∀l∈[K]. Then, under Assumptions 1 and 2, with probability at least \(\phantom {\dot {i}\!}1pC_{1}e^{C_{2}\xi n/p}\), any global minimizer \(\hat {L}\) to (3)–(5) satisfies
where
for some positive constants C_{1}, C_{2}, C1′(L_{α},g,ξ), C2′(L_{α},g,ξ,α_{2}), and C3′(L_{α},g,ξ,α_{2}). \(f=\frac {\Omega }{mn}=\frac {h}{m}\) is the data loss rate.
Theorem 1 characterizes the recovery error from partially observed, partially corrupted, and quantized measurements. It can be interpreted from the following aspects.
(1) Correction of corrupted measurements. We first fix the data loss rate f and consider the recovery performance with corrupted measurements. Suppose f is a constant, i.e., a constant fraction of the measurements are available. Then, (8) indicates that as long as the number of corrupted measurements s is at most Θ(md^{2}p), we have^{Footnote 3}
Thus, the recovery method tolerates a constant number of corrupted per column without degrading the recovery performance.
(2) Asymptotic recovery of the actual data. Since \(\mathcal {O}\left (\sqrt {\frac {d^{3}}{m}}\right)\) decreases to 0 when m increases to infinity, and ∥L^{∗}∥_{F} is in the order of \(\sqrt {mn}\), (10) indicates that the relative error between \(\hat {L}\) and L^{∗} diminishes asymptotically. Moreover, as long as p is o(n), the failure probability \(\phantom {\dot {i}\!}1pC_{1}e^{C_{2}\xi n/p}\) also decays to zero as n increases to infinity. The asymptotic recovery differentiates the operating center and cyber intruders. An operating center with a sufficient number of measurements can recover L^{∗} accurately. In contrast, a cyber intruder with access to a small number of users cannot recover the data even using the same approach (3)–(5).
(3) Tolerance of the missing data. To the best of our knowledge, only refs. [28] and [31] provided the theoretical analysis of lowrank matrix recovery from quantized observations with data losses. No corruptions are considered in [28,31]. The relative recovery error by [28] is \(\mathcal {O}\left (\sqrt {\frac {r^{3}}{m}}\right)\) under the partial observation case when f is a fixed constant, where r is the rank of the matrix. The relative recovery error by [31] is \(\mathcal {O}\left (\frac {r^{1/4}}{m^{1/4}}\right)\) under the partial observation case. Our result in (10) indicates that when f is a constant, the error is at most \(\mathcal {O}\left (\sqrt {\frac {d^{3}}{m}}\right)\) even with corrupted measurements. Note that the rank of L^{∗} can be as large as pd when the subspaces are all orthogonal to each other. If one directly applies the approach in [37] to our setup, the relative recovery error can be as large as \(\mathcal {O}\left (\sqrt {\frac {p^{3}d^{3}}{m}}\right)\), which is \(\sqrt {p^{3}}\) times our recovery error. Thus, our approach outperforms the existing one by recovering and clustering data simultaneously even in the special case of no corruptions.
When there is no missing data, the recovery error by [27] is \(\mathcal {O}\left (\sqrt {\frac {d}{m}}\right)\), which is slightly tighter than our error bound in (10). This is due to our techniques to handle the missing data.
Fundamental limit of any recovery method
The following theorem establishes the minimum possible error by any method from unquantized measurements. We consider the case that the number of corruptions is at most a constant fraction of the measurements. To simplify the analysis, we assume
where C_{0} is a constant smaller than 1/2. Let
Theorem 2
Let \(N \in \mathbb {R}^{m \times n}\) contain i.i.d. entries from \(\mathcal {N}\left (0, \sigma ^{2}\right)\). Assume (11) holds. Consider any algorithm that, for any \(X \in \mathcal {S}_{fX}\), takes M_{ij}=X_{ij}+N_{ij},(i,j)∈Ω as the input and returns an estimate \(\hat {X}\) of X. Then, there always exists some \(X \in \mathcal {S}_{fX}\) such that with probability at least \(\frac {3}{4}\),
holds for some fixed constants C_{3} and C_{4}, where \(C_{3} = \sqrt {\frac {12C_{0}}{8}}\min (\alpha _{1}, \alpha _{2})\) and \(C_{4} < \sqrt {\frac {12C_{0}}{256}}\). s_{Ω} is the number of errors in X_{Ω}.
Note that C_{3} is a constant. When f is a constant, (13) indicates that
The recovery error from unquantized measurements is at least \(\Theta \left (\sqrt {\frac {d}{m}}\right)\). Comparing it with our error bound \(\sqrt {\frac {d^{3}}{m}}\) in (10), one can see that our method is close to optimal. If the corrupted entries are randomly distributed, s_{Ω} is approximately Θ(fs). Then, the second term inside the minimization of (13) scales as \(\Theta \left (\frac {1}{\sqrt {f}}\sqrt {\frac {d}{m}}\right)\).
Privacy from the recovery perspective
Recovery of a single user from its own data only
An intruder is often interested in the data of a certain user. If the adversary only has access to one user’s data, then problems (3)–(5) are reduced to
Note that since n=1, there is no constraint on C. (15) maximizes the loglikelihood of one user given the information about the quantized measurements. It can be viewed as a special case of the lowrank matrix recovery from quantized measurements considered in [37]. One can check that the average recovery error is upper bounded by \(\mathcal {O}\left (\sqrt {d^{3}}\right)\) by setting n=1 in Theorem 5 of [37]. Similarly, the relative recovery by any method is at least in the order of \(\Theta (\sqrt {d})\) by setting n=1 in Theorem 4 of [37]. This error bound does not depend on m, the number of measurements of this user. Therefore, if an intruder only has one user’s data, even if m is very large, the average recovery error is nonzero and does not diminish as m increases. Then, the privacy of the energy consumption behavior of this user is protected.
Recovery of a single user by leveraging other users in the same group
One can exploit the measurements from other users to increase the estimation accuracy of one target user. Suppose one can access n users’ data in m time steps, and these users all share similar load patterns as the target user, then from either Theorem 1 of this paper or Theorem 5 of [37], the average recovery error is at most \(\mathcal {O}\left (\sqrt {\frac {d^{3}}{\min (m, n)}}\right)\). Compared with the previous case of accessing the data of one single user only, the recovery error is significantly reduced. We emphasize that the decrease of the recovery error results from exploiting correlations among users.
The number of quantization levels K also affects privacy. Intuitively, a smaller value of K corresponds to a higher level of privacy. However, the privacy level also depends on the selection of bin boundaries, and decreasing K does not necessarily increase privacy. For instance, if a pair of boundaries are chosen very close to each other so that no measurements located within the interval, then K=3 could reach the same privacy and recovery error as K=2. Therefore, K does not directly appear in Theorem 1 but rather affects the privacy indirectly through γ_{α} and L_{α}. The bin boundaries usually tend to be closer in the region where the measurements concentrate.
For smart meter data, the bin boundaries can be selected in the range of a typical household consumption level. If a certain house has some electrical appliances with an energy consumption level significantly higher than normal households, this abnormal pattern of high energy consumption can in fact be masked in the noisy and quantized measurements due to the way how bin boundaries are selected. However, since this house has a different load pattern from other households, one cannot exploit other users’ data to enhance the recovery accuracy of this user. The recovered data of this user will have a nonzero error as discussed in the first paragraph of Section 3.3.
Clustering guarantee
The clustering performance is evaluated through the subspacepreserving property of \(\hat {C}\). A sufficient condition for \(\hat {C}\) to be subspacepreserving is stated as follows.
Proposition 1
Suppose columns of \(\hat {L}\) are i.i.d. drawn from certain unknown continuous distribution supported on \(\hat {p}\) distinct ddimensional subspaces, then the global minimizer \(\hat {C}\) of (3) has the subspacepreserving property for \(\hat {L}\).
Ref. [27] also provides a sufficient condition for \(\hat {C}\) to be subspacepreserving. The subspaces are required to be independent with each other in [27]. Two independent subspaces intersect only at zero. Here, we require subspaces to be distinct from each other. Two subspaces are distinct if for each subspace, there exists one point that belongs to this subspace but not the other. The data points are generated based on some continuous distribution supported on these distinct subspaces.
Distributed sparse alternative proximal algorithm for data recovery and clustering
We next propose a distributed algorithm to solve (3) by W nodes collaboratively such that node i can estimate \(L^{*}_{\Phi _{i}}\) from its acquired measurements \(Y_{\Phi _{i}}\), while it does not know \(Y_{\Phi _{j}}\) or \(L^{*}_{\Phi _{j}}\) for all other j’s nodes. This further enhances user privacy.
We first follow [27] and move some constraints to the objective function to simplify the algorithm design. Since the rank of L is at most r, we factorize L as L=UV^{T}, where \(V \in \mathbb {R}^{n \times r}\) and \(U \in \mathbb {R}^{m \times r}\). We replace the equality constraints L=LC and L=UV^{T} by adding \(\frac {\lambda _{1}}{2}\left \V^{T}V^{T}C\right \_{F}^{2}\) and \(\frac {\lambda _{2}}{2}\left \UV^{T}L\right \_{F}^{2}\) to the objective function. The parameters λ_{1} and λ_{2} affect the tightness of the original constraints. Note that V^{T}=V^{T}C is a sufficient but not necessary condition for L=LC. Then, (3) is changed into
where
The solution of (16) is the same as that of (3) when λ_{1} and λ_{2} approach the infinity.
We next decompose V into W parts, and let \(V_{\Phi _{i} \star } \in \mathbb {R}^{q \times r}, i \in [W]\) denote the rows of V with row indices Φ_{i}. Then, the objective in (17) can be decomposed as follows:
where
and V contains \(V_{\Phi _{1} \star }\) to \(V_{\Phi _{W} \star }\), i.e., \(V = \left [\begin {array}{c} V_{\Phi _{1} \star } \\ V_{\Phi _{2} \star } \\ \vdots \\ V_{\Phi _{W} \star }\end {array}\right ]\). The constraint set \(\mathcal {S}\mathcal {F}\) in (18) is equivalent to the intersection of \({\mathcal {S}\mathcal {F}}_{i}\)’s (∀j∈[q]), with^{Footnote 4}
Then, (16) can be equivalently written as
where the estimated variables are U and W components of L,E,C, and V.
The constraints in (23) can be decomposed for W nodes, while the objective function cannot, due to the coupling of U and V. Here, we develop a synchronized Distributed Sparse Alternative Proximal Algorithm (DSAPA) to solve (23) with the convergence guarantee. The node i owns \(Y_{\Phi _{i}}\) and estimates \(V_{\Phi _{i} \star }\), \(L_{\Phi _{i}}\), \(E_{\Phi _{i}}\), \(C_{\Phi _{i}}\), and U. Since all nodes have the estimates of U, and \(L_{\Phi _{i}}={UV}_{\Phi _{i} \star }\), the key to protect user privacy of node i is not to share the estimate of \(V_{\Phi _{i} \star }\), as well as \(Y_{\Phi _{i}}\), to any other nodes.
In the (t+1)th iteration, node i sequentially updates \(C_{\Phi _{i}}^{t+1}\), \(V_{\Phi _{i} \star }^{t+1}\), \(L_{\Phi _{i}}^{t+1}\), \(E_{\Phi _{i}}^{t+1}\), U^{t+1} in Subroutines 1–5. Each subroutine essentially follows the projected gradient. The gradient of H with respect to \(V_{\Phi _{i} \star }\), \(L_{\Phi _{i}}\), \(E_{\Phi _{i}}\), U, and \(C_{\Phi _{i}}\) are
where M=V^{T}−V^{T}C, \(\phantom {\dot {i}\!}\iota _{i} = V_{\Phi _{i} \star }^{T}V_{\Phi _{i} \star }\), \(\phantom {\dot {i}\!}\zeta _{i} = L_{\Phi _{i}}V_{\Phi _{i} \star }\), and
The step sizes in the (t+1)th iteration are selected as
and
where e_{U} = λ_{2}∥(U^{t})^{T}U^{t}∥_{F}, \(\digamma _{i}^{t} =\! \left \I_{q \times q}+\!\\left (C_{\Phi _{i} \star }^{t+1}\right)\cdot \left (C_{\Phi _{i} \star }^{t+1}\right)^{T}\!\left (C_{\Phi _{i}}\right)_{\Phi _{i} \star }^{t+1}\,\, \left ((C_{\Phi _{i}})_{\Phi _{i} \star }^{t+1}\right)^{T}\right \_{F}\). These step sizes are no greater than the reciprocals of the smallest Lipschitz constants of \(\nabla _{C_{\Phi _{i}}} H\), \(\nabla _{V_{\Phi _{i} \star }} H\), \(\nabla _{L_{\Phi _{i}}} H\), \(\nabla _{E_{\Phi _{i}}} H\), and ∇_{U}H in the tth iteration, respectively. Details of the calculations are shown in Appendix 6. This property is useful for the convergence analysis of the DSAPA.
The constraints in (22) are met by projecting the updated estimates to \({\mathcal {S}\mathcal {F}}_{i}\). For the constraints on \(C_{\Phi _{i}}\), in steps 10–15 of Subroutine 1, we first set diagonal entries of \((C_{\Phi _{i}})_{\Phi _{i} \star }^{t+1}\) to zero. Then, we keep the d entries with the largest absolute value of \((C_{\Phi _{i}})_{\star j}^{t+1}\) and set all other entries to zero for any j∈[q]. The infinity norm on \(L_{\Phi _{i}}\) is met by setting all entries larger than α_{1} to be α_{1} and setting all entries smaller than −α_{1} to be −α_{1} (step 4 in Subroutine 3). A similar approach applies to \(E_{\Phi _{i}}\). We also keep \(\frac {s}{W}\) entries with the largest absolute values and set other nonzero entries to zero (steps 3–6 in Subroutine 4).
Note that \(L_{\Phi _{i}}\) and \(E_{\Phi _{i}}\) can be updated by node i independently and are not shared with other nodes. Updating \(C_{\Phi _{i}}\), \(V_{\Phi _{i}}\), and U needs communication from other nodes due to the coupling in the objective function. \(V_{\Phi _{i}}\) cannot be shared with other nodes, since otherwise other nodes can estimate \(L_{\Phi _{i}}\) by multiplying U and \(V_{\Phi _{i}}\). Thus, node i computes the intermediate terms that depend on \(V_{\Phi _{i}}\) and send to other nodes instead of sending \(V_{\Phi _{i}}\), as illustrated in Fig. 3.
The algorithm is initialized as follows. \(L_{\Phi _{i}}^{0}\) in node i is defined as,
Then, node i performs the truncated singular value decomposition on \(L^{0}_{\Phi _{i}}\) and let \(U_{i}^{(r)}\Sigma _{i}^{(r)} (V_{\Phi _{i} \star }^{(r)})^{T}\) denote the rankr approximation to \(L^{0}_{\Phi _{i}}\). Then, node i transmits \(U_{i}^{(r)}\) to all other nodes. Each node initializes at
The convergence of DSAPA is summarized as follows.
Theorem 3
From any initial point, DSAPA always converges to a critical point of (23).
The computational complexities of Subroutines 1–5 are \(\mathcal {O}(nqr)\), \(\mathcal {O}(mqr)\), \(\mathcal {O}(mq)\), \(\mathcal {O}(mq)\), and \(\mathcal {O}(mqr)\), respectively. The pernode periteration complexity of DSAPA is \(\mathcal {O}(nqr)\). In contrast, the complexity of the centralized algorithm in [27] is \(\mathcal {O}(nmr)\). The communication cost of Subroutines 1, 2, and 5 are \(\mathcal {O}(n^{2})\), \(\mathcal {O}(nWr)\), and \(\mathcal {O}(mWr)\), respectively.
For data clustering, a central node collects \(\hat {C}_{\Phi _{i}}\) from all the nodes and applies spectral clustering [41] to obtain the clustering results.
When λ_{1} and λ_{2} are large enough, (23) approximates (3), but the step sizes in (30)–(32) and (34) are small and that reduces the convergence rate. One practical solution is to dynamically increase λ_{1} and λ_{2} [55]. We suggest the following practical selection. Initialize with small λ_{1} and λ_{2}, and replace λ_{2} with ρλ_{2} (ρ>1) for the first T_{0} iterations. Then, reset λ_{2} to the initial value and update them with ρλ_{1} and ρλ_{2} simultaneously in each iteration. The algorithm terminates after T iterations.
Results: numerical experiments
We evaluate the performance on the Irish smart meter dataset (ISMD) [56] and the UMass smart ^{∗} microgrid dataset (USMD) [57]. The ISMD consists of more than 5000 residential customers. The measurements are obtained every 30 min and have a unit of kilowatt (kW). The UMSD contains 443 users in 24 h, and the power consumption is measured every minute. Some users have long sequences of zero power consumption, and some users have significantly high power consumption occasionally. We suspect these measurements have data quality issues resulting from devices or communication and remove these users from the datasets. We use 4780 customers in 30 days for ISMD and 438 customers in 6 h for USMD. Thus, the size of the data matrix L is 1440×4780 for ISND and 360×438 for USMD. The power consumption is at most 6 kW and 99 kW, respectively. Since the raw measurements are noisy, L is approximated by a rankr matrix \(L^{*}_{\textrm {rank}r}\) by keeping only the largest r singular values. The recovery error is measured by \(\L^{*}_{\textrm {rank}r}\tilde {L}\_{F}^{2}/\L^{*}_{\textrm {rank}r}\_{F}^{2}\), where \(\tilde {L}\) is the recovered matrix. We choose r to be about 10% of the total number of the singular values. Then, r is set to 150 for ISMD and 40 for USMD. The following experiments are tested on ISMD, if not otherwise specified.
As described in Section 2.3, normalized mutual information is used to measure the data privacy. We now calculate the average normalized mutual information of 4780 users \(\hat {NI} = \frac {1}{4780}\sum _{i=1}^{4780}NI(L_{\star i}, Y_{\star i})\). As a comparison, we also calculate the normalized mutual information between the noisy data (before quantization) and the actual data. The quantization level K is chosen as 2 or 5. The quantization boundaries and quantized values are summarized in Table 2 (K=2,5). We place more boundaries in the region where data concentrate. Selecting the optimal quantized boundaries is beyond the scope of this paper and will be left for the future work. We believe these parameters can be optimized if a small portion of groundtruth data are available for training. The noise level σ varies from 0.1 to 0.4 with a step size of 0.02. To compute the probabilistic distribution of L_{⋆i}, we divide the range 0–6 kW into 100 or 300 equal intervals and compute the empirical distributions. As shown in Fig. 4, the normalized mutual information between the power after quantization and the actual power consumption is always smaller than that between the noisy value before quantization and the actual power consumption. This indicates the proposed quantization process enhances the data privacy. In addition, the normalized mutual information \(\hat {NI}\) decreases when either K decreases or σ increases. That is consistent with the intuition.
Since no groundtruth clustering result exists for this dataset, we define an index CI to evaluate the clustering performance. Let a_{j} denote the maximum angle of all the data points in group j to the estimated subspace of this group. Let b_{j} denote the minimum angle of any point in group j to the other subspaces. The clustering index CI measures the clustering accuracy and is defined as
CI is large if a_{j}’s are small and b_{j}’s are large, which means that points in the same group are close to the subspace of that group and away from other groups. A larger CI corresponds to a better clustering result. We apply Sparse Subspace Clustering (SSC) [21] to this dataset with different cluster numbers and compare the resulting CI’s. We use the Alternating Direction Method of Multipliers (ADMM) [58] to solve SSC. When the number of clusters is p=4, we obtain the maximum CI=0.085. Thus, we set the number of clusters to be 4 in the following experiments.
We generate corruptions E^{∗} and noise N randomly. The nonzero entries of E^{∗} are selected from [−4,−0.5] and [0.5,4] uniformly. Every entry of N is drawn from the \(\mathcal {N}(0,0.3^{2})\). The quantization level K is set to 5. The locations of the missing data are selected randomly. The simulations run in MATLAB on a computer with 3.4 GHz Intel Core i7.
We evaluate DSAPA on the quantized measurements. We choose W=5 agents. We assume the upper bound of the magnitudes of the sparse error and the power consumption are known. For simplicity, we use the largest value of the given error and set α_{2}=4. Similarly, we set α_{1}=6. We set d=50. λ_{1} and λ_{2} are initialized to be 0.5, and ρ=1.05. The maximum iteration number T is set to be 200. T_{0} is set to be 40.
Here, d is selected to be approximately r/(p−1). We use p−1 considering the overlap between subspaces. We remark that varying d around the selected value does not affect the result. λ_{1} and λ_{2} are selfadjusted in our algorithm as discussed in the last paragraph of Section 4.
Figure 5 shows the energy consumption of a single user in 24 h. It compares the actual data, the rank150 approximation of the actual data, the quantized observations, the recovered data by DSAPA, and the average quantized data of the users in the same group. One can see that the rank150 approximation of the actual data has a similar pattern to the actual data. Clearly, the details of power consumption are hidden in the quantized measurements. For instance, the two peak consumptions are no longer visible in quantized measurements. Thus, an intruder does not know the user pattern if only accessing the quantized measurements of that user only. On the other hand, DSAPA recovers the power consumption trend accurately from the quantized data. The two peak loads are accurately identified in the recovered data as shown in Fig. 5. The recovered data can be used for grid planning.
After obtaining \(\hat {C}\) using DSAPA, we implemented spectral clustering [41] to cluster the data points. To visualize the recovered consumption pattern of users in each group, we normalize the power consumptions and compute the average of users in the same group. Figure 6 shows the average profile obtained by our method in 1 day (no missing data and with 15% missing data). For comparison, the mean daily profile of the groundtruth data clustered by SSC is also shown in Fig. 6. One can see that the data losses do not affect the recovery performance of DSAPA. The recovered patterns are close to the actual patterns obtained by SSC, considering that the measurements are highly noisy and quantized. Now we pick some users in the same group and average the quantized value (K=5) of these users. We calculate the normalized mutual information between one user and the averaged quantized value of the selected users. Figure 7 shows the normalized mutual information when the number of selected users varies. The value does not decrease much when the number of users increases. Compared with Fig. 4b, one can see that the averaged quantized value of the same group does not provide much information to the single user.
We compare DSAPA with Approximate Projected Gradient Method (APGM) [28] and Quantized Robust Principal Component Analysis (QRPCA) [35] for data recovery in Fig. 8a. We apply SSC on the recovered data by APGM (or QRPCA) to obtain the clustering result, labeled by “APGM + SSC” (“QRPCA + SSC”) in Fig. 8b. If we simply use the quantized value Q_{1},Q_{2},⋯,Q_{5} to estimate the actual power consumption, the relative recovery error is 0.869, which is much larger than the results in Fig. 8a. When the missing data rate changes from 0 to 0.4, our method always outperform the other methods both in data recovery and data clustering. For comparison, CI=0.085 for SSC on the groundtruth data, and CI=0.05 for a random clustering. Our method achieves CI=0.08 using quantized measurements with 5% corruptions and no data losses.
We vary the number of users by randomly selecting a subset of the 4780 users. Under the 15% missing rate and no corruption, Fig. 9 shows the recovery error when the number of users varies. The recovery error is 0.35 when the user number is to 500 and decreases to 0.2 when there are 2500 users.
We test the case when no additional noise is added before quantization. We vary the estimated noise level when implementing DSAPA since the measurements usually contain observation noise. As shown in Fig. 10, DSAPA can recover the data with no additional noise. However, adding no noise can lead to a low privacy level. The normalized mutual information when K=2 and K=5 are 0.2862 and 0.9579, respectively (0.02 kW per interval). These values are much higher than those shown in Fig. 4, indicating a lower level of privacy when no noise is added.
In Fig. 11, we compare the relative recovery error and the clustering index CI of DSAPA and the centralized algorithm SparseAPA in [27]. Since SparseAPA does not consider missing data, we study the case with full observations. The corruption rate is set as s/mn=5%. The recovery error of SparseAPA is small than our method when the algorithm initializes, because SparseAPA can compute a better initialization in a centralized fashion. However, the difference decreases as the iteration number increases. After 200 iterations, both algorithms perform similarly.
We next show the performance of DSAPA on USMD. Since the measurements vary from 0 to 100 kW, we set K=7, and α_{1}=50. The quantization boundaries and quantized values are in Table 2 (K=7). p and d are set to be 4 and 15, respectively, using the same technique as discussed in the previous experiments. We generate the corruptions E^{∗} and the noise N randomly. The nonzero entries of E^{∗} are selected from [−10,10] uniformly, and the corruption rate is 5%. Every entry of N is drawn from the \(\mathcal {N}(0,0.3^{2})\). Similar to Figs. 5 and 8a, we show the results on USMD in Fig. 12.
Conclusion and discussions
This paper for the first time shows that the two seemingly contradicting objectives of data privacy and information accuracy of smart meter data can be achieved simultaneously. The central technical contribution is the development of a decentralized data recovery and clustering method from highly quantized, partially lost, and partially corrupted measurements. Distributed nodes do not share raw data with each other and cannot estimate the actual data of other nodes. We propose a Distributed Sparse Alternative Proximal Algorithm (DSAPA) with a convergence guarantee to solve the nonconvex problem. The recovery error of our method is nearly optimal. The method is evaluated on actual smart meter datasets. Future works include leveraging the time correlation within each user to further improve the method and developing unsynchronized decentralized data recovery algorithms.
Appendix 1
Supporting lemmas used in the Proof of Theorem 1
Lemma 1
Under Assumptions 1 and 2, the following inequalities hold
where \(a = \frac {\sqrt {\xi mn}}{h\sqrt {p}}\), \(b = \frac {\sqrt {\xi gdmn}\mathcal {C}}{\sqrt {hp}}\). Ω_{i} includes the indices of the observed entries.
Proof
From Assumptions 1 and 2,
where (a) holds from Lemma 8 and Lemma 9 in [28], and the assumption n_{i}≤ξn/p. (b) holds because of σ_{1}(G_{i})≥h and \(\sigma _{2}(G_{i}) \le \mathcal {C} \sqrt {h}\).
Then
where (c) follows from (40) (or (42)). (d) holds from \(\sum _{i=1}^{p}\left \\left (\hat {L}_{i}L^{*}_{i}\right)_{\Omega _{i}}\right \_{F} \le \sqrt {p}\left \\left (\hat {L}{L}^{*}\right)_{\Omega }\right \_{F}\). Then, we have the desired result. □
Lemma 2
Let \(\hat {\theta }=\text {vec}(\hat {X})\), θ^{∗}=vec(X^{∗}), \(\mathcal {F}(\theta ^{*}) = F(\theta ^{*})\), and \(\hat {X}\), \(X^{*} \in \mathcal {S}_{fX}\). Follow the same assumptions as those of Theorem 1. Then, with probability at least \(\phantom {\dot {i}\!}1pC_{1}e^{C_{2}\xi n/p}\),
holds for the positive constants C_{1} and C_{2}. 〈.,.〉 denotes the inner product of two matrices, i.e., the sum of entrywise products.
Proof
The proof is generalized from the proof of Lemma 2 in [27] which does not consider missing data. Here we extend the analysis to handle missing data. According to the definition, there exists a permutation matrix Γ^{∗} such that L^{∗} can be written as \(L^{*}=\left [L^{*}_{1},L^{*}_{2},...,L^{*}_{p}\right ]\Gamma ^{*}\). By Assumption 2, \(\hat {L}\) can be written as \(\hat {L}=\left [\hat {L}_{1},\hat {L}_{2},...,\hat {L}_{p}\right ]\Gamma ^{*}\), where the dimension of \(\hat {L}_{i}\) is smaller or equal to (g−1)d.
Note that \(\sum _{l=1}^{K}\varphi _{l}\left (\left (X^{*}_{i}\right)_{k,j}\right)=1\) and \(\left [L^{1}_{\alpha }\\nabla _{X}F\left (X^{*}_{i}\right)\\right ]_{k,j} \= L^{1}_{\alpha }\\sum _{l=1}^{K}\\frac {\dot {\varphi }_{l}\left (\left (X^{*}_{i}\right)_{k,j}\right)}{\varphi _{l}\left (\left (X^{*}_{i}\right)_{k,j}\right)}\\cdot \\boldsymbol {1}_{[(Y_{i})_{k,j}=l]}\). Combining them with (7), one can conclude that the elements of \(L^{1}_{\alpha }\nabla _{X}F\left (X^{*}_{i}\right)\) have zero mean, and the variances are bounded by one. Using the result of Lemma 1 in [27], we have
holds with probability at least \(\phantom {\dot {i}\!}1C_{1}e^{C_{2}\xi n/p}\). \(X^{*}_{i}\) is the same ith group as \(L^{*}_{i}\) under the permutation Γ^{∗}. Then
holds with probability at least \(\phantom {\dot {i}\!}1pC_{1}e^{C_{2}\xi n/p}\).
(a) holds from the linearity of the inner product. The first term of (b) holds from 〈A,B〉≤∥A∥_{2}∥B∥_{∗}. The second term of (b) holds from the fact that both \(\hat {E}, E^{*}\) have at most s nonzero entries and ∇_{X}F(X^{∗})_{i,j}≤1. (c) holds from (45) and the fact \(\left \\hat {L}_{i}L^{*}_{i}\right \_{*} \le \sqrt {2gd}\left \\hat {L}_{i}L^{*}_{i}\right \_{F}\). (d) holds from Lemma (40). (e) holds from \(\sum _{i=1}^{p}\left \\left (\hat {L}_{i}L^{*}_{i}\right)_{\Omega _{i}}\right \_{F} \le \sqrt {p}\(\hat {L}{L}^{*})_{\Omega }\_{F}\). (f) holds because \(\(\hat {E}E^{*})_{\Omega }\_{F} \le 2\alpha _{2}\sqrt {s}\), which results from the fact that \(\hat {E}_{i,j}E^{*}_{i,j}\) is bounded by 2α_{2}. The probability \(\phantom {\dot {i}\!}1pC_{1}e^{C_{2}\xi n/p}\) comes from the union bound for \(P(\max _{i \in [p]}\\nabla _{X}F(X^{*}_{i})\_{2} \le 2.01L_{\alpha }\sqrt {\xi n/p})\). □
Appendix 2
Proof of Theorem 1
Proof
The proof follows and extends the proofs of Theorem 1 in [28] and Theorem 5 in [37]. We extend from the lowrank matrices in [28,37] to matrices with columns in p lowdimensional subspaces. Moreover, ref. [28] does not consider corruptions, and ref. [37] does not consider missing data. Here we consider both missing data and corruptions.
The first bound \(2\alpha _{1}+2\alpha _{2}\sqrt {\frac {s}{mn}}\) in (8) follows from the fact that \(\hat {L}\), \(L^{*},\hat {E}\), \(E^{*} \in \mathcal {S}_{f}\). We discuss the second bound in (8) as follows. We denote (4) to be F(X) when we treat X to be the variable. Note that \(\mathcal {S}_{fX}\) is a compact set, and the objective function is continuous in X. F(X) then achieves a minimum in \(\mathcal {S}_{fX}\). Suppose that \(\hat {X} \in \mathcal {S}_{fX}\) minimizes F(X).
Let \(\theta =\text {vec} (X) \in \mathbb {R}^{mn}\) and \(\mathcal {F}_{\Omega,Y}(\theta)=F(X)\). By the secondorder Taylor’s theorem, we have
where \(\tilde {\theta }=\theta ^{*}+\bar {\eta }(\theta \theta ^{*})\) for some \(\bar {\eta }\in [0,1]\), with corresponding matrices \(\tilde {X}=X^{*}+\bar {\eta }(XX^{*})\).
From (46), Lemma 2, and Lemma A.3 in [38], we have
holds with probability at least \(\phantom {\dot {i}\!}1pC_{1}e^{C_{2}\xi n/p}\) where \(c_{f} = 4.02L_{\alpha }gda\sqrt {\xi n}\), \(\eta = 8.04L_{\alpha }gda\sqrt {\xi n}\alpha _{2} \sqrt {s}+ 8.04L_{\alpha } gd\sqrt {\xi np}\alpha _{1} b + 2\alpha _{2} s L_{\alpha } \).
By solving (47), we then have
Thus,
where M_{1}– M_{5} are constants. (a) holds because of (41). (b) holds according to (48). (c) holds because of the CauchySchwarz inequality. (d) holds because f=h/m, \(\frac {M_{2}d^{\frac {5}{4}}n^{\frac {1}{2}}m^{\frac {1}{4}}}{h^{\frac {5}{4}}p^{\frac {1}{2}}} =\frac {M_{2}\kappa d^{\frac {5}{4}}}{f^{\frac {5}{4}}m^{\frac {1}{2}}}\), and \(\frac {M_{5} d }{h^{\frac {1}{2}}}=\frac {M_{5} d }{f^{\frac {1}{2}}m^{\frac {1}{2}}}\). The order of both terms are smaller than \(O\left (\frac {\kappa d\sqrt {d}}{f^{2} \sqrt {m}}\right)\). □
Appendix 3
Supporting lemmas for Theorem 2
Lemma 3
There exists a set \(\mathcal {X}\subset \mathcal {S}_{fX}\) with
such that the following properties hold for any γ∈(0,1]:
1. For all \(X\in \mathcal {X}\), X_{i,j}=±αγ or 0, ∀(i,j), where α= min(α_{1},α_{2}).
2. For all X^{(i)}, \(X^{(j)}\in \mathcal {X}\), i≠j,
Proof
Now we independently generate a set \(\mathcal {X}\) of \(\left \lceil \exp \left (\frac {dn  d \lfloor \frac {s}{m} \rfloor }{16}\right) \right \rceil \) random matrices from the following distribution. According to columns’ indices, X is first been divided into X_{1},X_{2},⋯,X_{p}, which correspond to indices \(\{1,...,\lfloor \frac {n}{p} \rfloor \}, \{\lfloor \frac {n}{p} \rfloor +1,...,2\lfloor \frac {n}{p} \rfloor \}, \{2\lfloor \frac {n}{p} \rfloor +1,...,3\lfloor \frac {n}{p} \rfloor \},\cdots, \\{(p1) \ \lfloor \frac {n}{p} \rfloor \ +1,...,n\}\), respectively. For the first d rows of X_{1}, fix the locations of \(\lfloor \frac {s}{pm} \rfloor \) entries in each row and set the values to zero. The remaining \(d\lfloor \frac {n}{p} \rfloor d\lfloor \frac {s}{pm} \rfloor \) entries take values ±αγ with equal probabilities. For all i∈{d+1,...,m}, \(j \in [\lfloor \frac {n}{p} \rfloor ]\),
The same process is applied to X_{2},X_{3},⋯,X_{p}. Then, one can see that X can be written as X=L+E, where L can span subspaces with dimension smaller or equal to d, and E is a sparse matrix. We further have
Each column of L can be represented by at most d other columns. Thus, \(\mathcal {X}\in \mathcal {S}_{fX}\).
Note that the locations of the zero entries are the same for all matrices drawn from the above distribution. Consider two different matrices X and \(\hat {X}\) drawn as above, we have
where δ_{i}’s are independent 0/1 Bernoulli random variables and the means are all \(\frac {1}{2}\). Following the same proof technique of Lemma 4 in [37], one can show that \(\mathcal {X}\) satisfies the property 2. □
Let Y=X+N, where the entries in matrix N are i.i.d. and generated from Gaussian distribution \(\mathcal {N}(0,\sigma ^{2})\). Suppose that \(X \in \mathcal {X}\) is chosen uniformly at random. Lemma 4 bounds the mutual information I(X_{Ω},Y_{Ω}).
Lemma 4
Proof
The proof is similar to the proof of Lemma 5 in [31], but [31] does not consider corruptions. We modify the proof to handle corruptions. From Lemma 5 in [31], one can obtain
where ℵ denotes a matrix with all entries are i.i.d. generated from {+1,−1}. \(\tilde {X}=X\cdot \aleph \) denotes the entrywise product of X and ℵ.
The vectorization of \(\tilde {X}_{\Omega }+N_{\Omega }\) is denoted by \(\text {vec}(\tilde {X}_{\Omega }+N_{\Omega }) \in \mathbb {R}^{\Omega }\). We compute the covariance matrix as
Then, by Theorem 8.6.5 in [59], we have
The equality holds since \(\tilde {X}\) has s_{Ω} zero entries.
We have \(H(N_{\Omega }) = \frac {1}{2}\log ((2\pi e)^{\Omega }\sigma ^{2\Omega })\) and thus
which establishes the lemma. □
Appendix 4
Proof of Theorem 2
Proof
The proof follows Theorem 4 in [31] which does not consider the corruptions. Our proof is more involved due to the corruptions. Choose ε so that
where C_{4} is a constant to be determined later. The set \(\mathcal {X}\) is defined in Lemma 3. γ is set to be
Suppose for the sake of a contradiction that there exists an efficient algorithm such that for any \(X \in \mathcal {S}_{fX}\), given the measurements Y, returns an \(\hat {X}\), and
holds with probability at least 1/4. Let
Following the proof of Theorem 4 in [31], one can find that if (62) holds, then X^{∗}=X. By the assumption of (62),
Let X be a matrix chosen uniformly at random from \(\mathcal {X}\). Considering running the algorithm on X, then by Fano’s inequality, the probability that X≠X^{∗} is at least
We have obtained \(\mathcal {X}\) from Lemma 3 and I(X_{Ω},Y_{Ω}) from Lemma 4. Then, using the inequality log(1+z)≤z, we obtain
Combining (66) with (61) and (64), we obtain
which implies that
Setting \(C_{4}^{2} < \frac {12C_{0}}{256}\) leads to a contradiction, hence (62) must fail to hold with probability at least 3/4. Using the definition \(f= \frac {\Omega }{mn}\), we obtain the desired result. □
Appendix 5
Proof of Proposition 1
Proof
Given any i, from (5), we know that \(\hat {L}_{\star i}=\hat {L}\hat {C}_{\star i}\). Without loss of generality, we assume \(\hat {L}_{\star i} \in \hat {S}_{1}\), where the \(\hat {p}\) subspaces are denoted by \(\hat {S}_{i}\) (\(i\in [\hat {p}]\)). Then, from the constraint \(\hat {C}_{i,i}=0, \forall i \in [n]\), we have \(\hat {L}_{\star i}=[\hat {L}_{1\backslash \star i} ~~\hat {L}_{1}]\left [\begin {array}{c} \hat {C}_{\star i}^{(1\backslash \star i)} \\\hat {C}_{\star i}^{(1)}\end {array}\right ]\), where \(\hat {L}_{1\backslash \star i}\) denotes all data points belonging to \(\hat {S}_{1}\) except \(\hat {L}_{\star i}\). \(\hat {L}_{1}\) denotes all data points belonging to \(\{\hat {S}_{j}\}_{j=2}^{\hat {p}}\). \(\hat {C}_{\star i}^{(1\backslash \star i)}\) and \(\hat {C}_{\star i}^{(1)}\) are sparse coefficients corresponding to \(\hat {L}_{1\backslash \star i}\) and \(\hat {L}_{1}\), respectively. Now we only need to prove that \(\hat {C}_{\star i}^{(1)}=0\).
If \(\hat {C}_{\star i}^{(1)}\neq 0\), then \(\hat {L}_{\star i}\) belongs to a subspace \(\hat {S}_{1}'\) which is different from \(\hat {S}_{1}\), and spanned by data points corresponding to nonzero entries of \(\left [\begin {array}{c} \hat {C}_{\star i}^{(1\backslash \star i)} \\\hat {C}_{\star i}^{(1)}\end {array}\right ]\). Moreover, the dimension of \(\hat {S}_{1}'\) must be smaller or equal to d since \(\\left [\begin {array}{c} \hat {C}_{\star i}^{(1\backslash \star i)} \\\hat {C}_{\star i}^{(1)}\end {array}\right ]\_{0} \le d\). Therefore, \(\hat {L}_{\star i} \in \hat {S}_{1}''=\hat {S}_{1}' \bigcap \hat {S}_{1}\), where \(\bigcap \) denotes the intersection of two subspaces. We first consider the case when the dimension of \(\hat {S}_{1}''\) is smaller than d. Since the data points of \(\hat {L}_{\star }\) are sampled from a continuous distribution of \(\hat {p}\) subspaces, the probability that the data point \(\hat {L}_{\star i}\) lying in a datapointspanned hyperplane in \(\hat {S}_{1}\) that has dimension smaller than d is 0 (to see this, consider the probability of a data point lying in a prefix line within a plane). Next we show that the number of such hyperplanes is finite. Because the data points are fixed beforehand, there is only a finite number of combinations of data points that can span \(\hat {S}_{1}'\) and further intersect with \(\hat {S}_{1}\) to form \(\hat {S}_{1}''\). Then, the probability of the union of a finite of combinations is still zero. Therefore, the dimension of \(\hat {S}_{1}''\) equals to d, which indicates that the dimensions of \(\hat {S}_{1}'\) and \(\hat {S}_{1}\) are both d. This leads to \(\hat {S}_{1}''=\hat {S}_{1}'=\hat {S}_{1}\). This results in a contradiction, since the data points corresponding to \(\hat {C}_{\star i}^{(1)}\neq 0\) do not belong to \(\hat {S}_{1}\). Thus, \(\hat {C}_{\star i}^{(1)}=0\), and the claim holds. □
Appendix 6
DSAPA: proof of the Lipschitz differential property and calculation of Lipschitz constants
A function is Lipschitz differentiable if and only if all its partial gradients are Lipschitz continuous. The definition is shown in Definition 3.
Definition 3
[60] For any fixed matrices z_{1},z_{2},..,z_{n}, matrix variable y, and a function y→Υ(y,z_{1},z_{2},...,z_{n}), the partial gradient ∇_{y}Υ(y,z_{1},z_{2},...,z_{n}) is said to be Lipschitz continuous with Lipschitz constant L_{p}(z_{1},z_{2},...,z_{n}), if the following holds
We provide the Lipschitz differential property of H and compute the corresponding Lipschitz constants of its partial gradients with respect to \(C_{\Phi _{i}},V_{\Phi _{i} \star },L_{\Phi _{i}},E_{\Phi _{i}}\), ∀i∈[W]. Let \(L^{t+1}_{p1}\), \(L^{t+1}_{p2}\), \(L^{t+1}_{p3}\), \(L^{t+1}_{p4}\), and \(L^{t+1}_{p5}\) denote the smallest Lipschitz constants of \(\nabla _{C_{\Phi _{i}}} H\), \(\nabla _{V_{\Phi _{i} \star }} H\), \(\nabla _{L_{\Phi _{i}}} H\), \(\nabla _{E_{\Phi _{i}}} H\), and ∇_{U}H in the (t+1)th iteration. We have
where (a) follows from (30). Equation (69) implies that
where (b) follows from the triangle inequality, and (c) follows from (31). Equation (71) implies that
where (d) comes from the differential mean value theorem. \(\nabla ^{2} F(\bar {L}_{\Phi _{i}}) \in \mathbb {R}^{m\times q}\) has the (k,j)th entry equaling to \({\frac {\partial ^{2} F}{\partial ^{2} (L_{\Phi _{i}})_{k,j}}}_{(\bar {L}_{\Phi _{i}})_{k,j}}\), and \(\text {diag}(\nabla ^{2} F(\bar {L}_{\Phi _{i}})) \in \mathbb {R}^{mq\times mq}\) is a diagonal matrix with the diagonal vector equaling to \(\text {vec}(\nabla ^{2} F(\bar {L}_{\Phi _{i}}))\). (e) follows from the fact that the l_{2} norm of a diagonal matrix is equal to its entrywise infinity norm. Note that (1) is lower bounded by β, and the probability density function of the normal distribution and its derivative are upper bounded by \(\frac {1}{\sqrt {2\pi } \sigma }\) and \(\frac {e^{1/2}}{\sqrt {2\pi } \sigma ^{2}}\), respectively. Then, one can easily check that \(\\nabla ^{2} F(\bar {L}_{\Phi _{i}})\_{\infty }\) is bounded by \(\frac {1}{\sigma ^{2} \beta ^{2}}\). (f) is thus obtained by upper bounding \(\\nabla ^{2} F(\bar {L}_{\Phi _{i}})\_{\infty }\). (g) follows from (32). Thus, \(\tau _{L}(E_{\Phi _{i}}^{t}) \le \frac {1}{L^{t+1}_{p3}}\).
where (h) follows from the differential mean value theorem. (i) is obtained by upper bounding \(\\nabla ^{2} F(\bar {E}_{\Phi _{i}})\_{\infty }\) by \(\frac {1}{\sigma ^{2} \beta ^{2}}\). (j) follows from (33). (74) implies that \(\tau _{E}(L_{\Phi _{i}}^{t+1})=\sigma ^{2} \beta ^{2} \le \frac {1}{L^{t+1}_{p4}}\).
where (k) follows from the inequality ∥·∥_{2}≤∥·∥_{F}. (l) follows from \((V^{t+1})^{T}V^{t+1}=\sum _{i=1}^{W} \iota _{\Phi _{i}}^{t+1}\). Since \(\\lambda _{2} \sum _{i=1}^{W} \iota _{\Phi _{i}}^{t+1}\_{F} \geq L^{t+1}_{p5}\), (m) follows from (34). (75) implies that \(L^{t+1}_{p5} \leq \\lambda _{2} \sum _{i=1}^{W} \iota _{i}^{t+1}\_{F}, \textrm { and} \tau _{U}(V^{t+1}) \le 1/L^{t+1}_{p5}\).
Based on Definition 3, (69)–(75) guarantee the Lipschitz differentiability of H and provide the Lipschitz constants and the step sizes of the DSAPA.
Appendix 7
Proof of Theorem 3
Proof
The constraints in (22) can be transferred to the following indicator functions.
(76)–(80) correspond to the operations of projection in DSAPA.
Similar to the proof of Theorem 3 in [27], DSAPA globally converges to a critical point of (16) from any initial point, provided that H is Lipschitz differentiable, and
satisfies the KurdykaLojasiewicz (KL) property.
The proof of the Lipschitz differentiable property of H is shown in Appendix 6. \(B(L_{\Phi _{i}})\), \(\phantom {\dot {i}\!}J_{1}\left (E_{\Phi _{i}}\right)\), \(\phantom {\dot {i}\!}J_{2}\left (E_{\Phi _{i}}\right)\), \(\phantom {\dot {i}\!}K_{1}(C_{\Phi _{i}})\), and \(\phantom {\dot {i}\!}K_{2}(C_{\Phi _{i}})\) are indicator functions of semialgebraic sets. Therefore, they are KL functions according to [60]. Since H is differentiable everywhere, or equivalently, real analytic, H also has the KL property according to the examples in session 2.2 of [61]. Thus, (81) satisfies the KL property. □
Availability of data and materials
The Irish smart meter datasets that support the findings of this study are available from the Irish Social Science Data Archive (ISSDA) but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of the Irish Social Science Data Archive (ISSDA). The UMass smart ^{∗} microgrid dataset analyzed during the current study is available in http://traces.cs.umass.edu/index.php/Smart/Smart.
Notes
 1.
Throughout this paper, we refer to each household as one user.
 2.
S_{i}’s (i∈[p]) are distinct provided for any i, j, there always exists some β that belongs to S_{i} but not S_{j}.
 3.
We use the notations \(u(n)\in \mathcal {O}(v(n))\), u(n)∈o(v(n)), or u(n)=Θ(v(n)) if as n goes to infinity, u(n)≤c·v(n), u(n)≥c·v(n) or c_{1}·v(n)≤u(n)≤c_{2}·v(n) eventually holds for some positive constants c, c_{1} and c_{2}, respectively.
 4.
We assume for simplicity that the corruptions are distributed evenly such that the number of nonzero entries in \(E_{\Phi _{i}}\) is at most \(\frac {s}{W}\). The algorithm can be easily extended to cases that the numbers of corruptions are different as long as a reasonable accurate upper bound of the number of corruptions is available.
Abbreviations
 UoS:

Union of Subspaces
 DSAPA:

Distributed Sparse Alternative Proximal Algorithm
 c.d.f:

Cumulative distribution function
 SSC:

Sparse Subspace Clustering
 PMU:

Phasor measurement unit
 KL:

KullbackLeibler
 NI:

Normalized mutual information
 APGM:

Approximate projected gradient method
 QRPCA:

Quantized Robust Principal Component Analysis
 NILM:

Nonintrusive load monitoring
References
 1
G. W. Hart, Nonintrusive appliance load monitoring. Proc. IEEE. 80(12), 1870–1891 (1992).
 2
E. J. Aladesanmi, K. A. Folly, Overview of nonintrusive load monitoring and identification techniques. IFACPapersOnLine. 48(30), 415–420 (2015).
 3
Z. Erkin, G. Tsudik, in International Conference on Applied Cryptography and Network Security. Private computation of spatial and temporal power consumption with smart meters (SpringerSingapore, 2012), pp. 561–577.
 4
P. Barbosa, A. Brito, H. Almeida, S. Clauß, in Proceedings of the 29th Annual ACM Symposium on Applied Computing, SAC ’14. Lightweight privacy for smart metering data by adding noise (ACMGyeongju, 2014), pp. 531–538.
 5
J. M. Bohli, C. Sorge, O. Ugus, in 2010 IEEE International Conference on Communications Workshops. A privacy model for smart metering (IEEECape Town, 2010), pp. 1–5.
 6
M. Backes, S. Meiser, in Data Privacy Management and Autonomous Spontaneous Security. Differentially private smart metering with battery recharging (SpringerBerlin, 2014), pp. 194–212.
 7
D. Varodayan, A. Khisti, in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Smart meter privacy using a rechargeable battery: minimizing the rate of information leakage (IEEEPrague, 2011), pp. 1932–1935.
 8
D. Egarter, C. Prokop, W. Elmenreich, in 2014 IEEE International Conference on Smart Grid Communications (SmartGridComm). Load hiding of household’s power demand (IEEEVenice, 2014), pp. 854–859.
 9
S. McLaughlin, P. McDaniel, W. Aiello, in Proceedings of the 18th ACM Conference on Computer and Communications Security. Protecting consumer privacy from electric load monitoring (ACMChicago, 2011), pp. 87–98.
 10
X. He, X. Zhang, C. C. J. Kuo, A distortionbased approach to privacypreserving metering in smart grids. IEEE Access. 1:, 67–78 (2013).
 11
M. Savi, C. Rottondi, G. Verticale, Evaluation of the precisionprivacy tradeoff of data perturbation for smart metering. IEEE Trans. Smart Grid. 6(5), 2409–2416 (2015).
 12
O. Tan, D. Gunduz, H. V. Poor, Increasing smart meter privacy through energy harvesting and storage devices. IEEE J. Sel. Areas Commun.31(7), 1331–1341 (2013).
 13
F. L. Quilumba, W. J. Lee, H. Huang, D. Y. Wang, R. L. Szabados, Using smart meter data to improve the accuracy of intraday load forecasting considering customer behavior similarities. IEEE Trans. Smart Grid. 6(2), 911–918 (2015).
 14
A. Albert, R. Ram, Smart meter driven segmentation: what your consumption says about you. IEEE Trans Power Syst.28(4), 4019–4030 (2013).
 15
N. MahmoudiKohan, M. P. Moghaddam, M. K. SheikhElEslami, E. Shayesteh, A threestage strategy for optimal price offering by a retailer based on clustering techniques. Int. J. Electr. Power Energy Syst.32(10), 1135–1142 (2010).
 16
C. Dwork, in International Conference on Theory and Applications of Models of Computation, Differential privacy: A survey of results. Differential privacy (SpringerXi’an, 2008), pp. 1–19.
 17
L. Sankar, S. Kar, R. Tandon, H. V. Poor, in Proc. IEEE International Conference on Smart Grid Communications (SmartGridComm). Competitive privacy in the smart grid: an informationtheoretic approach (IEEEBrussels, 2011), pp. 220–225.
 18
C. Y. Ma, D. K. Yau, in Proceedings of the 10th ACM Symposium on Information, Computer and Communications Security. On informationtheoretic measures for quantifying privacy protection of timeseries data (ACMSingapore, 2015), pp. 427–438.
 19
S. Li, A. Khisti, A. Mahajan, Informationtheoretic privacy for smart metering systems with a rechargeable battery. IEEE Trans. Inf. Theory. 64(5), 3679–3695 (2018).
 20
A. Reinhardt, F. Englert, D. Christin, Averting the privacy risks of smart metering by local data preprocessing. Pervasive Mob. Comput.16:, 171–183 (2015).
 21
E. Elhamifar, R. Vidal, Sparse subspace clustering: algorithm, theory, and applications. IEEE Trans. Pattern Anal. Mach. Intell.35(11), 2765–2781 (2013).
 22
B. Eriksson, L. Balzano, R. Nowak, in Proc. Int. Conf. Artif. Intell. Stat. Highrank matrix completion (JMLRLa Palma, 2012), pp. 373–381.
 23
G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, Y. Ma, Robust recovery of subspace structures by lowrank representation. IEEE Trans. Pattern Anal. Mach. Intell.35(1), 171–184 (2013).
 24
V. M. Patel, H. Van Nguyen, R. Vidal, Latent space sparse and lowrank subspace clustering. IEEE J. Sel. Topics Signal Process.9(4), 691–701 (2015).
 25
M. Soltanolkotabi, E. J. Candès, A geometric analysis of subspace clustering with outliers. Ann. Stat.40(4), 2195–2238 (2012).
 26
M. Soltanolkotabi, E. Elhamifar, E. J. Candés, Robust subspace clustering. Ann. Stat.42(2), 669–699 (2014).
 27
R. Wang, M. Wang, J. Xiong, Data recovery and subspace clustering from quantized and corrupted measurements. IEEE J. Sel. Topics Signal Process., Spec Issue Robust Subspace Learn. Tracking Theory Algoritm Appl.12(6), 1547–1560 (2018).
 28
S. A. Bhaskar, Probabilistic lowrank matrix completion from quantized measurements. J. Mach. Learn. Res.17(60), 1–34 (2016).
 29
Y. Cao, Y. Xie, in Proc. IEEE Int. Workshop Comput. Adv. MultiSensor Adapt. Process. Categorical matrix completion (IEEECancun, 2015).
 30
T. Cai, W. X. Zhou, A maxnorm constrained minimization approach to 1bit matrix completion. J. Mach. Learn. Res.14(1), 3619–3647 (2013).
 31
M. A. Davenport, Y. Plan, E. van den Berg, M. Wootters, 1bit matrix completion. Inf. Infer.3(3), 189–223 (2014).
 32
P. Gao, M. Wang, J. H. Chow, M. Berger, L. M. Seversky, Missing data recovery for highdimensional signals with nonlinear lowdimensional structures. IEEE Trans. Signal Process.65(20), 5421–5436 (2017).
 33
O. Klopp, J. Lafond, É Moulines, J. Salmon, Adaptive multinomial matrix completion. Electron. J. Stat.9(2), 2950–2975 (2015).
 34
J. Lafond, O. Klopp, E. Moulines, J. Salmon, in Adv. Neural Inf. Process. Syst. Probabilistic lowrank matrix completion on finite alphabets (Curran AssociatesMontreal, 2014), pp. 1727–1735.
 35
A. S. Lan, C. Studer, R. G. Baraniuk, in Proc. IEEE Int. Conf. Acoust Speech Signal Process. Matrix recovery from quantized and corrupted measurements (IEEEFlorence, 2014), pp. 4973–4977.
 36
A. S. Lan, A. E. Waters, C. Studer, R. G. Baraniuk, Sparse factor analysis for learning and content analytics. J. Mach. Learn. Res.15(1), 1959–2008 (2014).
 37
P. Gao, R. Wang, M. Wang, J. H. Chow, Lowrank matrix recovery from noisy, quantized and erroneous measurements. IEEE Trans. Signal Process.66(11), 2918–2932 (2018).
 38
S. A. Bhaskar, in Proc. Asilomar Conf. Signals Syst. Comput. Probabilistic lowrank matrix recovery from quantized measurements: application to image denoising, (2015), pp. 541–545.
 39
S. A. Bhaskar, Localization from connectivity: a 1bit maximum likelihood approach. IEEE/ACM Trans. Netw.24(5), 2939–2953 (2016).
 40
Y. Yang, J. Feng, N. Jojic, J. Yang, T. S. Huang, in European Conference on Computer Vision. l0sparse subspace clustering (SpringerAmsterdam, 2016), pp. 731–747.
 41
A. Y. Ng, M. I. Jordan, Y. Weiss, in Adv. Neural Inf. Process. Syst. On spectral clustering: analysis and an algorithm (Morgan Kaufmann PublishersVancouver, 2002), pp. 849–856.
 42
J. Lin, E. Keogh, L. Wei, S. Lonardi, Experiencing sax: a novel symbolic representation of time series. Data Min. Knowl. Discov.15(2), 107–144 (2007).
 43
E. Keogh, K. Chakrabarti, M. Pazzani, S. Mehrotra, in Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data. Locally adaptive dimensionality reduction for indexing large time series databases (ACMSanta Barbara, 2001), pp. 151–162.
 44
R. Basri, D. W. Jacobs, Lambertian reflectance and linear subspaces. IEEE Trans. Pattern Anal. Mach. Intell.25(2), 218–233 (2003).
 45
P. Gao, M. Wang, S. G. Ghiocel, J. H. Chow, B. Fardanesh, G. Stefopoulos, Missing data recovery by exploiting lowdimensionality in power system synchrophasor measurements. IEEE Trans. Power Syst.31(2), 1006–1013 (2016).
 46
M. B. Hossain, I. Natgunanathan, Y. Xiang, L. X. Yang, G. Huang, Enhanced smart meter privacy protection using rechargeable batteries. IEEE Internet Things J.6(4), 7079–7092 (2019).
 47
A. Reinhardt, D. Egarter, G. Konstantinou, D. Christin, in 2015 IEEE International Conference on Smart Grid Communications (SmartGridComm). Worried about privacy? Let your PV converter cover your electricity consumption fingerprints (IEEEMiami, 2015), pp. 25–30.
 48
L. Sweeney, kanonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness KnowlBased Syst.10(05), 557–570 (2002).
 49
R. L. Lagendijk, Z. Erkin, M. Barni, Encrypted signal processing for privacy protection: conveying the utility of homomorphic encryption and multiparty computation. IEEE Signal Proc. Mag.30(1), 82–105 (2012).
 50
T. Baumeister, in 2011 IEEE International Conference on Smart Grid Communications (SmartGridComm). Adapting PKI for the smart grid (IEEEBrussels, 2011), pp. 249–254.
 51
G. Giaconi, D. Gündüz, H. V. Poor, in 2015 IEEE International Conference on Communications (ICC). Smart meter privacy with an energy harvesting device and instantaneous power constraints (IEEEMiami, 2015), pp. 7216–7221.
 52
G. Kalogridis, C. Efthymiou, S. Z. Denic, T. A. Lewis, R. Cepeda, in 2010 First IEEE International Conference on Smart Grid Communications. Privacy for smart meters: towards undetectable appliance load signatures (IEEEGaithersburg, 2010), pp. 232–237.
 53
J. GomezVilardebo, D. Gündüz, Smart meter privacy for multiple users in the presence of an alternative energy source. IEEE Trans. Inf. Forensic Secur.10(1), 132–141 (2014).
 54
Y. Hong, W. M. Liu, L. Wang, Privacy preserving smart meter streaming against information leakage of appliance status. IEEE Trans. Inf. Forensic Secur.12(9), 2227–2241 (2017).
 55
J. A. Snyman, N. Stander, W. J. Roux, A dynamic penalty function method for the solution of structural optimization problems. Appl. Math. Model.18(8), 453–460 (1994).
 56
Commission for Energy Regulation Smart Metering Project. http://www.ucd.ie/issda/data/commissionforenergyregulationcer. Accessed 5 July 2018.
 57
S. Barker, A. Mishra, D. Irwin, E. Cecchet, P. Shenoy, J. Albrecht, et al, Smart*: an open data set and tools for enabling research in sustainable homes. SustKDD, August. 111(112), 108 (2012).
 58
S. P. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends® Mach. Learn.3(1), 1–122 (2011).
 59
T. M. Cover, J. A. Thomas, Elements of Information Theory (Wiley, Hoboken, 2012).
 60
J. Bolte, S. Sabach, M. Teboulle, Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program.146(12), 459–494 (2014).
 61
Y. Xu, W. Yin, A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM J. Imag. Sci.6(3), 1758–1789 (2013).
Acknowledgements
This research is supported in part by ARO W911NF1710407 and the RensselaerIBM AI Research Collaboration (http://airc.rpi.edu), part of the IBM AI Horizons Network (http://ibm.biz/AIHorizons).
Author information
Affiliations
Contributions
Ren and Meng conceived and designed the method and the experiments. Ren performed the experiments and drafted the manuscript. Meng revised the manuscript. Jinjun provided many helpful suggestions. The authors read and approved the final manuscript.
Corresponding author
Correspondence to Meng Wang.
Ethics declarations
Consent for publication
Informed consent was obtained from all authors included in the study.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wang, R., Wang, M. & Xiong, J. Achieve data privacy and clustering accuracy simultaneously through quantized data recovery. EURASIP J. Adv. Signal Process. 2020, 22 (2020). https://doi.org/10.1186/s13634020006827
Received:
Accepted:
Published:
Keywords
 Subspace clustering
 Quantization
 Data recovery
 Data privacy
 Smart meter