Sketching for Sequential Change-Point Detection

We study sequential change-point detection using sketches (linear projections) of high-dimensional signal vectors, by presenting the sketching procedures that are derived based on the generalized likelihood ratio statistic. We consider both fixed and time-varying projections, and derive theoretical approximations to two fundamental performance metrics: the average run length (ARL) and the expected detection delay (EDD); these approximations are shown to be highly accurate by numerical simulations. We also characterize the performance of the procedure when the projection is a Gaussian random projection or a sparse 0-1 matrix (in particular, an expander graph). Finally, we demonstrate the good performance of the sketching performance using simulation and real-data examples on solar flare detection and failure detection in power networks.


I. INTRODUCTION
Detecting change-points online from high-dimensional streaming data is a problem arising frequently from applications such as real-time monitoring for sensor networks, computer network anomaly detection and computer vision (e.g. [2], [3]). To reduce data dimensionality, a common approach is sketching [4], which essentially performs random projection of the high-dimensional data vectors into lowerdimensional ones.
In this paper, we consider performing change-point detection using sketches of high-dimensional data vectors, and present a new sequential sketching procedure utilizing the generalized likelihood ratio (GLR) statistics. In particular, suppose we may choose an M × N matrix A with M N , and consider projections of the vectors y t = Ax t , t = 1, 2, . . .. Assume the pre-change vector is zero-mean Gaussian distributed, and the post-change vector is Gaussian distributed with an unknown mean vector µ, and with the covariance matrix unchanged. Here we assume the mean vector is unknown since it typically represents an anomaly. The sketching procedure detects such a change by forming a GLR statistic, where the unknown µ is replaced with its maximum likelihood ratio estimator. Due to random projection of the data into a lower dimensional space, we are not able to find a unique estimator for the mean. Instead, there will be a family of estimators and we show that picking one such estimator from the family suffices. In this sense, we find that the task of detecting the presence of a signal from its linear projection is much easier than the corresponding tasks of recovering the signal or its support.
We obtain analytic expressions for two fundamental performance metrics: the average run length (ARL) when there is no change and the expected detection delay (EDD) when there is a change-point by extending the results in [5]. Our approximations are shown to be highly accurate using simulations. These approximations are quite useful in determining the threshold of the detection procedure to control false-alarms, without having to resort to the onerous numerical simulations.
Our theoretical results also enable us to choose the sketching matrix. We examine the EDD ratio between the sketching procedure and a procedure using the original data, when the sketching matrix A is either a random Gaussian matrix or a sparse 0-1 matrix (in particular, an expander graph). We find that the sketching procedure may achieve a performance very similar to that using the original data when the signalto-noise ratio is sufficiently large, even when the post-change mean vector is not sparse. This result is consistent with our numerical results. Furthermore, we present an algorithm for time-varying projection and analyze its performance. Finally, we demonstrate the good performance of our procedures on a real data example of solar flare detection and a synthetic example of power failure detection in a network with realworld power network topology.
Change-point detection problems are closely related to industrial quality control and multivariate statistical control charts (SPC), where an observed process is assumed initially to be in control and at a change-point becomes out of control. The idea of using random projections for change detection has been explored for SPC in a pioneering work [6] based on U 2 multivariate control chart, the follow-up work [7] for cumulative sum (CUSUM) control chart and the exponential weighting moving average (EWMA) schemes, and in [8], [9] based on the Hotelling statistic. These works provide a complementary perspective from SPC design, while our method takes a different approach and is based on sequential hypothesis testing, treating both the change-point location and the post-change mean vector as unknown parameters. By treating the change-point location as an unknown parameter when deriving the detection statistic, the sequential hypothesis testing approach overcomes the drawback of some SPC methods due to a lack-of-memory, such as the Shewhart chart and the Hotelling chart, since they cannot utilize the information embedded in the entire sequence [10]. Moreover, our sequential GLR statistic may be preferred over the CUSUM procedure in the setting when it is difficult to specify the post-change mean vector.
Besides the above distinctions from the SPC methods, other novelty of our methods also include: (1) we developed new theoretical results for the sequential GLR statistic, combining analytical techniques for sequential change-point detection and compressed sensing; (2) we consider the sparse 0-1 and time-varying projections (the sparse 0-1 projection corresponds to sampling the observations and is easily usable in practice); the analysis of these two cases are completely new and not taken from literature; (3) we study the amount of dimensionality reduction can be performed (i.e., the minimum M ) with little performance loss.
This paper extends on our preliminary work reported in [1] but we provide several important extensions: (1) we introduce time-varying sketching projections, which is more widely applicable, and provide theoretical performance analysis; (2) we include more extensive numerical examples to verify the accuracy of our theoretical results analysis; (3) we include new real-data examples to show the good performance of our sketching procedures for solar flare detection and power failure detection.
Our work is related to compressive signal processing [11], where the problem of interest is to estimate or detect (in the fixed-sample setting) a sparse signal using compressive measurements. In [12], an offline test for a non-zero vector buried in Gaussian noise using noise linear measurements is studied; interestingly, a conclusion similar to ours is drawn that the task of detection within this setting is much easier than the tasks of estimation and support recovery. Another related work is [13], which considers a problem of identifying a subset of data streams within a larger set, where the data streams in the subset follow a distribution (representing anomaly) that is different from the original distribution; the problem considered therein is not a sequential change-point detection problem as the "change-point" happens at the onset (t = 1). In [14], the authors also consider an offline setting with one sequence of samples and the goal is to identify k out of n samples whose distributions are different from the normal distribution f 0 . They use a "temporal" mixing of the samples over this finite horizon. This is different from our setting, where we project over the signal dimension at each time. In [15], change-point detection using "compressive" measurements of a high-dimensional vector is studied in a Bayesian framework (i.e., Shiryaev's procedure); theoretical analysis of average detection delay is presented. Other related methods include applying kernel methods [16] and [17] but they focus on offline change-point detection. Finally, using change-point detection for detecting a transient change in power systems has been studied in [18], and the method is tailored to a physical dynamic model of the real-world power system.
Another possible approach is to obtain compressed representations of the data streams by applying principal component analysis (PCA) [19]. The dimensionality reduction by PCA is achieved by projecting the signal along a fixed subspace, which is defined by the eigenspace of the signal associated with the dominant eigenvalues. Our theoretical approximation for ARL and EDD can also be applied in those settings. However, when the dimension of the data is really large, it is hard to implement PCA-base method [20].
Our notations are standard: χ 2 k denotes the Chi-square distribution with degree-of-freedom k, I n denotes an identity matrix of size n; X † denotes the pseudoinverse of a matrix X; [x] i denotes the ith coordinate of a vector x; [X] ij denotes the ijth element of a matrix X; x denotes the transpose of a vector or matrix x.
The rest of the sections are organized as follows. Section II sets up the formulation of the sketching problem for sequential change-point detection. Section III presents the sketching procedures for solving the problem. Section IV presents the performance analysis for the sketching procedures. Section V demonstrates the accuracy of our performance analysis and the good performance of our sketching procedures on both simulated and real data. Section VI concludes the paper. All proofs are delegated to the appendix.

II. FORMULATION
Consider a sequence of observations over a possibly infinite time horizon x 1 , x 2 , . . . , x t , t = 1, 2, . . ., where x t ∈ R N and N is the signal dimension. Initially the observations are due to noise. There can be a time κ such that an unknown change-point occurs and it changes the mean of the signal vector. Such a problem can be formulated as the following hypothesis testing problem: where the unknown mean vector is defined as Our goal is to detect the change-point as soon as possible after it occurs, while subject to the false alarm constraints.
Here, we assume the covariance of the data to be an identity matrix, and the change only occurs to the mean. This models a large class of problems, such as environment monitoring and genomic copy number variation detection [21] [22]. To reduce data dimensionality, we may project each signal x t into a lower dimensional space using a linear projection, which we refer to as sketching. We aim to develop procedures that can detect a change-point using the resulted sketches. In the following, we consider two approaches to sketching: the fixed projection and the time-varying projection.
Fixed projection. Choose an M ×N projection matrix A with M N and obtain low dimensional sketches via: From the earlier model (1), the hypothesis test based on the sketches in (2) can be reformulated as H 0 : y t ∼ N (0, AA ), t = 1, 2, . . .
Note that both mean and covariance structures are changed by the projections.
Time-varying projection. In certain applications, we may allow the sketching matrix to vary at each time. This may offer, for instance, a more robust data collection process. Assume at each time, the projection A t ∈ R Mt×N , where the number of sketches M t can be time-varying as well.
Under the time-varying model, the hypothesis test based on the sketches becomes: Note that we may allow the number of sketches to be different at each time as well. The above models also capture several important cases: (i) (Pairwise comparison.) In certain problems such as social network study and computer vision, we are interested in pairwise comparison of variables [23] [24]. This will entail a total of N 2 possible comparison results, and we may select M out of N 2 such comparisons to reduce complexity. Such pairwise comparison problems can be modeled as structured fixed projections A with only {0, 1, −1} entries, for instance: (ii) (Missing data.) In many scenarios, each time we are only able to observe a subset of entries of the signal, and the location of the entries that we can observe can vary with time [25]. Suppose at each time we observe a subset of entries Ω t ∈ {1, . . . , N }. Assume |Ω t | = M t , then A t ∈ R Mt×N is a submatrix of an identity matrix with rows selected according to Ω t . (iii) (PCA.) There are also methods using principal component analysis (PCA) of data streams for change-point detection (e.g., [19], [26]). The main idea is to extract the several principle components of the data streams in a sliding time window, and performing change detection (e.g., using SPC control chart a CUSUM procedure) with these principle components. Note that this method can be viewed as using fixed projections A, with A being the eigenspace of the signal associated with the dominant eigenvalues.

A. Fixed projection
We may derive a likelihood ratio based detection procedure for the hypothesis test (3) using sketches. The statistic involves an average of samples within a window [k, t], which is defined asȳ Since the observations are i.i.d. over time, for an assumed value of the change-point κ = k, for our hypothesis testing model in (3), the log-likelihood of observations accumulated up to time t > k is given by where (6) is the log likelihood ratio (log-LR) statistic for detection.
Since µ is unknown, we replace it with a maximum likelihood estimator for fixed k and t in the likelihood ratio (6), to obtain the log generalized likelihood ratio (log-GLR) statistic (one example of log-LR and log-GLR statistics is given in [27]). Taking the derivative of (t, k, µ) in (6) with respect to µ and setting it to zero, we obtain an equation for the maximum likelihood estimate µ * of the post-change mean vector. As a function of the current number of observations t and putative change-point location k, µ * needs to satisfy Note that we may also write this as Hence, a maximum likelihood estimator for the post-change mean vector is any µ * that satisfies for a vector c ∈ R N that lies in the null space of A: A c = 0.
In particular, we may choose the zero vector c = 0, and use the estimator such that Substituting such a µ * into (6), we form the log generalized ratio statistic (GLR). Using (8), the first and second terms in (6) become, respectively, Finally, (6) gives the log-GLR statistic: Finally, we define a sketching procedure that stops and raises an alarm whenever the log-GLR statistic raises above a threshold b > 0: Here w > 0 is a window-size, i.e., we only consider the past w samples from the current time t. The role of the windowsize is two-fold. On the one hand, it reduces the memory requirements to implement the detection procedure, and on the other hand, it effectively establishes a minimum level of change that we want to detect. We may further simplify the log-GLR statistic in (9) by using the singular value decomposition (SVD) of A and derive an equivalent procedure. This facilitates the performance analysis in Section IV. Let the SVD of A be given by where U ∈ R M ×M , Σ ∈ R M ×M is a diagonal matrix containing all non-zero singular values, and V ∈ R N ×M . Using the SVD of A, (AA ) −1 = U Σ −2 U , we can write the log-GLR statistic (9) as Substitution of (5) into (12) results in a detection statistic Similarly, under the alternative hypothesis y i ∼ N (Aµ, AA ), and hence, z i ∼ N (V µ, I M ). Then we have the following equivalent form for the sketching procedure T in (10): The form of T offers one intuitive explaination: the sketching procedure essentially projects the data into M (less than N ) independent data streams and detects the change by forming a log-GLR statistic using these independent data streams.

B. Time-varying projection
Similarly, for an assumed value of the change-point κ = k, the log-LR statistic for observations accumulated up to time t is given by Again, to obtain the log-GLR statistic, we replace the unknown µ with a maximum likelihood estimator using data between the hypothetical change location k and current time t. Taking the derivative of (t, k, µ) in (14) with respect to µ and setting it to zero, we obtain an equation for the maximum likelihood estimate µ * . As a function of fixed k and t, µ * needs to satisfy To solve for µ * , one needs to discuss the rank of the matrix

Consider two cases:
(i) A i s are independent Gaussian random matrices. Note that the columns within each V i are linearly independent since they are orthonormal. Moreover, the columns in different U i blocks are independent since U i s are independent, and their entries are drawn as independent continuous random variables. Therefore, the columns of Q are linearly independent with probability 1 and rank(QQ ) = min(N, (t − k)M ) with probability one. We can claim that if t−k < N/M , QQ is not full rank and if t − k ≥ N/M , QQ is full rank with probability one, (ii) A i s are independent random matrices with entries drawn from a discrete distribution. In this case, we can also claim that if t − k < N/M , QQ is not full rank and if t − k ≥ N/M , QQ is full rank with high probability, however, not with probability one.
To avoid the issue of invertibility, we use the pseudoinverse instead. Define then from (15), µ * is given by Substituting such a µ * into (14), we obtain the log-GLR statistic: We cannot further simplify the expression of GLR in (16) without specifying the form of A t , t = 1, 2, . . .. In such cases, computation of the pseudo-inverse may occur a high cost on the order of O(N 3 ).
In the special case when A t has 0-1 entries only, which corresponds to the case for missing data and the expander graph, we can obtain a simpler expression for the statistic in (16) that can be computed efficiently. In this case, A t A t is an M t -by-M t identity matrix, and A t A t is a diagonal matrix, which simplifies our computation. For a diagonal matrix D ∈ R N ×N with diagonal entries λ 1 , . . . , λ N , the pseudo-inverse of D is also a diagonal matrix with diagonal entries λ −1 Therefore, the statistic in (16) can be simplified to where P Ai (x i ) denotes an N dimensional vector with the unobserved entries at time i filled by 0, and V n (k, t) denotes the number of times that the nth entry is observed during the time interval [k + 1, t]. Hence, for this special case, the sketching procedure with time-varying projection becomes: where b > 0 is the prescribed threshold. Note that the statistic essentially "averages" the observed vector within the time window [t − w, t), which performs a kind of "missing data imputation".

A. Approximations to ARL and EDD, fixed projection
First, we introduce some notation for theoretical analysis. Under the null hypothesis H 0 (1), the observations have zero mean. Probability and expectation in this case are denoted by P ∞ and E ∞ , respectively. Under the hypothesis H 1 , alternatively, there exists a change-point κ, 0 ≤ κ < ∞ such that the observations have mean µ for all t > κ. Probability and expectation in this case are denoted by P κ and E κ , respectively.
The choice of the threshold b involves a tradeoff between two standard performance metrics that are commonly used for analyzing change-point detection procedures [5]: (i) the expected value of the stopping time when there is no change, the average run length (ARL); (ii) the expected detection delay (EDD), defined to be the expected stopping time in the extreme case where a change occurs immediately at κ = 0, which is denoted as E 0 {T }. The following argument from [28] explains why we consider E 0 {T }. When there is a change at κ, we are interested in the expected delay until its detection, i.e., the conditional expectation E κ {T − κ|T > κ}, which is a function of κ. When the shift in the mean only occurs in the positive direction It is not obvious that this remains true when [µ] i can be either positive or negative. However, since E 0 {T } is certainly of interest and reasonably easy to analyze, it is common to consider E 0 {T }.
By extending the results in [5] based on the observations in (13), we can obtain approximations to the ARL and the EDD as follows.
Theorem 1 (ARL). Assume that 1 ≤ M ≤ N , b → ∞ with M → ∞ and b/M fixed. Then for w = o(b r ) for some positive integer r, we have that the ARL of the sketching procedure defined in (10) is given by where and the special function (cf. [29], page 82). For numerical purposes, a simple and accurate approximation is given by (cf. [30]) Theorem 2 (EDD). Suppose b → ∞ with other parameters held fixed. Then for a given matrix A with right singular vectors V , the EDD of the sketching procedure (10) when κ = 0 is given by HereS t t i=1 δ i is a random walk where the increments δ i are independent and identically distributed with mean The proofs for the above two theorems utilize the fact explained in (13). When there is no change-point, the sketching procedure is equivalent to monitoring M data streams consisting of white noise with unit variances, and when there is a change-point, the post-change distribution of the observation vector has mean shifted to V µ with the same covariance. This analysis demonstrates that the sketching procedure corresponds to a special case of the so-called "mixture procedure" (cf. T 2 in [5]) with p 0 = 1, M sensors and the post-change mean vector is V µ. Remark 1. The first order approximation to the EDD is b/(∆ 2 /2), which is the threshold b divided by the Kullback-Leibler (K-L) divergence (see, e.g., [5] for the fact that ∆ 2 /2 is the K-L divergence between N (0, I M ) and N (V µ, I M )). This makes sense intuitively since the expected increment of the detection statistics (viewed as a random process indexed over time) is roughly the K-L divergence of the test. Remark 2. We may establish the asymptotic optimality of T in (13) using the following arguments. This can be based on our earlier work [31] that proves the asymptotic optimality for the more complicated non-i.i.d scenarios. Define C(γ) = {T : E ∞ {T } ≥ γ} as the set of the stopping times of which the ARL has a prescribed lower bound γ. We may show that O(log γ) and log w = o(log γ).
Remark 3. For a fixed large ARL, when M increases, the threshold b grows approximately linearly with M . Fig. 1 is a such example, where N = 100, w = 200, we fix the ARL to be 5000, and find the corresponding threshold b using Theorem 1 when M increases from 10 to 100. (In Section V, we verify that Theorem 1 is an accurate approximation.) Note that the resulting b versus M curve is very close to a straight line (shown as the dashed line). This can also be interpreted using Theorem 1.
tend to constants, and the remaining terms can be written as A common property of sequential change-point detection procedures is that the ARL grows exponentially with the threshold b [2], since we can achieve a large ARL for a reasonably valued threshold b. To achieve this, due to (23), we need b to grow at least linearly with M .
Remark 4. To compare performance of the sketching procedure with that using the original data, we consider the the ratio between EDD using the original data (denoted as EDD(N )) and EDD using the sketches (denoted as EDD(M )). Let From Theorem 2, we have that the EDD of the sketching . Remark 3 states that to achieve a constant ARL, b should grow linearly with M . Hence, the ratio EDD(N )/EDD(M ) is proportional to In Section IV-B, we will show that when A is a Gaussian random matrix, Γ is on the order of M/N . Hence, for Gaussian random matrix, this ratio is on the order of (N/M )·(M/N ) = 1. This result is intriguing, since it implies that there is little loss in performance by Gaussian random projection, as we indeed observe from numerical examples.
Moreover, we can show in Section IV-B2 that when A is a sparse 0-1 matrix with d non-zero entries on each row (in particular, an expander graph), Γ is on the order of M (1 − )/(dN ). Hence, the ratio is on the order of (N/M ) · [(M (1 − ))/(dN )] = (1 − )/(d), for some small number > 0. This implies that the sparse matrix may still be used for sketching purposes, but it may perform worse than the Gaussian random matrix by a factor of 1/d. Such a result is expected as the sparse A maps less information into the sketches in reducing the complexity of the measurements, since only few entries of the signals are required for each measurement. These theoretical predictions are also validated by numerical studies in Section V. There is one possible intuitive explanation for the above. Unlike in compressed sensing, where our goal is to recover a sparse signal and hence one needs the projection to preserve norm up to a factor through the restricted isometry property (RIP) [32], here our goal is to detect a non-zero vector in Gaussian noise. Hence, even though the projection reduces the norm of the vector, as long as the projection does not diminish the signal normal to be below the noise floor, we are still able to detect it. Since the Gaussian matrix is a dense matrix with probability one, it will achieve this goal; the expander graph, however, since it is sparse, may have some probability of missing the signal through projection.

B. Choice of A and bounding Γ
In the following, we establish bounds for Γ when A is a Gaussian matrix and an expander graph, respectively.
1) Gaussian matrix: Consider A ∈ R M ×N whose entries are i.i.d. Gaussian with mean zero and variance 1/M . We show that, provided M and N grow proportionally, Γ approaches lim N →∞ M/N at a rate exponential in N .
Lemma 1 ( [33]). Let A ∈ R M ×N have i.i.d. N (0, 1) entries, and its SVD be given by A = U ΣV T . Then for any fixed vector µ, More related results can be found in [34]. Since the Beta(α, β) distribution has mean α/(α + β), we have Next, we may also show that, provided M and N grow proportionally, Γ converges to its mean value at a rate exponential in N . Define δ ∈ (0, 1) to be Theorem 3 (Gaussian A). Let A ∈ R M ×N have entries i.i.d. N (0, 1). Let N → ∞ such that (27) holds. Then for 0 < < min(δ, 1 − δ), at a rate exponential in N . d"="2" c"="3" M"parity"check"nodes" N"variable"nodes" 2) Sparse sketching matrix A: Interestingly, it can be shown that we may also use a sparse 0-1 matrix A for sketching (even if µ is not sparse), up to a small performance loss (i.e., longer EDD for fixed ARL), and provided that the post-change mean vector is element-wise positive (corresponds to the so-called "one-sided" change scenario typically considered in environmental monitoring [5]). These sparse sketching matrices A enable efficient sketching schemes, as each entry in the sketching vector only requires collecting information from few dimensions of the original data vector.
Assume [µ] i ≥ 0 for all i. Consider a matrix A ∈ R M ×N consisting of binary entries, which corresponds to a bipartite graph. Following coding theory terminology, we call the left variables nodes (there are N such variables), which correspond to the entries of x t , and we call the right variables parity check nodes (there are M such nodes), which correspond to entries of y t . In a bipartite graph, connections between the variable nodes are not allowed. The adjacency matrix of the bipartite graph corresponds to our A here. We further consider a bipartite graph with regular left degree c (i.e., the number of edges from each variable node is c), and regular right degree d (i.e., the number of edges from each parity check node is d), as illustrated in Fig. 2. Hence, this requires N c = M d.
Expander graphs satisfy the above requirements, and they have been used in compressed sensing to sense a sparse vector (e.g. [35]). In particular, a matrix A corresponds to a (s, )expander graph with regular right degree d if and only if each column of A has exactly d "1"s, and for any set S of right nodes with |S| ≤ s, the set of neighbors N (S) of the left nodes has size N (S) ≥ (1 − )d|S|. If it further holds that each row of A has c "1"s, we say A corresponds to a (s, )expander with regular right degree d and regular left degree c. The existence of such expander graphs is established in [36]:

C. Approximation for ARL and EDD of time-varying projection
It is intractable to directly analyze the general procedure T 2 . However, we may characterize the performance of T 2 in (18) for the "missing data" case when A i consists of 0-1 entries and M t = M, t > 0, using the following scheme. Assume each time we randomly sample M out of N entries to observe. Fig. 3 explains the idea, which is a realization of the "sampling" scheme in procedure T 2 , when N = 10 rows, t − k = 17 columns and we take M = 3 subsamples each time. Dots denote the observed entries. Since we randomly sample 3 observations each time, the sum of the number of the dots on each column in Fig. 3 is 3.
The idea for approximation is based on relaxing the constraint that each time we observe exactly M out of N entries, to sampling each entry of x i with probability M/N , for 1 ≤ n ≤ N . This eases finding an approximation to the EDD of T 2 . Define ξ ni 's as i.i.d. Bernoulli random variables with parameter for 1 ≤ n ≤ N and i ≥ 1 and define Based on this, we introduce an approximate procedure: where b > 0 is the prescribed threshold. Next, we further approximate T 2 so that we can apply the previous analysis for ARL of (10) in Theorem 1. Taking the expectation with respect to the Bernoulli random variables, we replace t i=k+1 [x i ] n ξ nj by its expectation r t i=k+1 [x i ] n and t j=k+1 ξ nj by its expectation (t − k)r.

Define
Then, a further approximation of T 2 is given by where b > 0 is the prescribed threshold. Following a similar argument as that in Theorem 1 by replacing M with N and b with b = b/r, we have that where c(N, b , w) is defined by replacing b with b in (20). Table IV in Section V-C verifies that the theoretical threshold b obtained by this approximation is very close to that obtained from simulation. We can also obtain the first order approximation of EDD for the procedure in (18) by using the approximation in T 2 . Note that If the window size w is sufficiently large, we have that Using Wald's identity [29] and ignoring the overshoot of the statistic over the threshold b, we obtain a first order approximation by equating the right-hand side of (31) to b, taking expectations on both sides, obtaining the equation from which we can solve for a first-order approximation for the expected detection delay Numerical examples in Section V-C demonstrates that this is a reasonably good approximation.

V. NUMERICAL EXAMPLES
In the subsequent examples, we select ARL to be 5000 to represent a low rate of false detection (similar choice is made in other sequential change-point detection work such as [5]). In practice, however, the target ARL value depends on sampling rate and how frequent we can tolerate false detection (once a month or once a year). All simulated results are obtained from 10 4 repeatitions.

A. Gaussian random matrix
Consider A generated as a Gaussian random matrix, with entries i.i.d. N (0, 1/N ).
1) Accuracy of theoretical approximations: We use Theorem 1 to find the threshold b corresponding to an ARL equal to 5000. Since the ARL is an increasing function of the threshold b, we use bisection to find the threshold b that corresponds to a targeted ARL 5000. Then we compare it with a threshold b that is found from simulation. As shown in Table I, the threshold find using Theorem 1 is very close to that obtained from simulation. Therefore, even if these theoretical ARL approximation is derived in an asymptotic regime when N tends to infinity, it is applicable when N is large but finite. This is quite useful in determining a threshold for a targeted ARL, as simulations for large N and M can be quite time-consuming, for a large ARL (e.g., 5000 or 10,000).
Moreover, we also simulate the EDD detecting a signal with a post-change mean vector µ with entries all equal to [µ] i = 0.5. As also shown in Table I, the approximations for EDD using Theorem 2 are also resonably accurate. 2) Comparison of EDD: Then, we compare EDD of the sketching procedure using the original data (which corresponds to A = I) with the sketching procedures using a Gaussian A, with N = 500 and various M < N .
First, consider the post-change mean vector [µ] i = µ 0 . Fig. 4(a) shows EDD versus an increasing signal strength µ 0 . Clearly, when µ 0 is sufficiently large, the sketching procedure can approach the performance using the original data as predicted by the theory in Section IV-B. From Fig. 4(a), we can also obtain the minimum M required so that the EDD of the sketching procedure is within one plus the EDD using the original data, which we denote as M * . The results are shown in Table II. When µ 0 is sufficiently large, we may use M to be even less than 30 for a N = 500 dimensional signal. Note that we do not require signals to be sparse.   Then, assume that the post-change mean vector has only 100p% entries µ i being one and the other entries being zero. Fig. 4(b) shows EDD versus an increasing p. As p increases, the signal strength also increases, and the sketching procedure approaches the performance using the original data. Similarly, we can obtain the minimum M required so that the EDD of the sketching procedure is within one plus the EDD using the original data. For example, when p = 0.5, one can use M = 100 for a N = 500 dimensional signal to detect the change in the mean without much performance loss.

B. Expander graph
Now consider A generated as an expander graph. We compare the EDD for the change-point detection procedure using the original data (which corresponds to A = I) with the sketching procedures using an expander graph A as above with N = 500 and various M < N . To demonstrate the performance when A is an expander graph, we run the simulations with the same settings as those in Section V-A. We verify the theoretical approximations are accurate for the expander graphs too (details omitted).
1) Comparison of EDD: Again, consider the post-change mean vector [µ] i = µ 0 . Fig. 5(a) shows EDD with an increasing µ 0 . Note that the simulated EDDs are much smaller than those for the Gaussian random projections in Fig. 4. A possible reason is that the expander graph is better at aggregating the signals when [µ] i are all positive. However, when [µ] i are can be either positive or negative, the two choices of A have similar performance, as shown in Fig. 6 Then, assume that the post-change mean vector has only 100p% entries µ i being one and the other entries being zero. Fig. 5(b) shows the simulated EDD versus an increasing p. As p tends to 1, the sketching procedure approaches the performance using the original data. From Fig. 5(b) we can obtain the minimum M required so that the EDD of the sketching procedure is within one plus the EDD using the original data, as shown in Table III. For example, when p is around 0.5, we may use M = 50 for a N = 500 dimensional signal. When p is larger than 0.7, one may use M less than 30 for a N = 500 dimensional signal to detect the change, which implies that we can use a fairly small number of sketches to achieve a quick detection.     C. Time-varying projection by 0-1 matrices 1) Accuracy of theoretical approximations: For timevarying projection, Table IV shows the accuracy of the approximation of b s using the ARL in (29) and the accuracy of the approximation of EDD in (32) with various M s if N = 100, w = 200, when all [µ] i = 0.5. Note that the approximate is more accurate if M/N is close to 1. One intuitive explanation is that, when M/N is close to 0, replacing the number of observations for each entries by its mean (t − k)M/N is less accurate, and hence, the approximation by using T 2 for T 2 is less accurate.
To understand whether sparsity of the post-change mean vector affects the accuracy of the ARL approximation in (32), we perform another experiment letting N = 100, w = 200 and 25% of [µ] i = 1. Note that the total signal energy N n=1 µ 2 n is the same as above, so the theoretical EDD computed from (32) should be same. The simulated EDDs are shown in Table V, which shows that the approximation in (32) is still accurate.
2) Comparison of EDD: To demonstrate the performance of time-varying projection A t defined in (18) with 0-1 entries, we again consider the two cases: the post-change mean vector [µ] i = µ 0 ; the post-change mean vector has 100p% entries [µ] i being one and the other entries being zero. The simulated EDDs are shown in Fig. 7. Note that (18) is able to detect change quickly with a small subset of observations. Even if EDDs of (18) are larger than those for the fixed Gaussian matrix and the expander graph in Fig. 4 and Fig. 5, this is meaningful, since in practice when performing subsampling for dimensionality reduction we usually scan through the signal by choosing a different subset of signal coordinates to observe.

D. Comparison with classical Multivariate CUSUM method
We compare our sketching method with a benchmark approach based on the classical method. The benchmark method applies the classic multivariate CUSUM procedure [37] directly to sketches. By doing so, it does not incoporate the covariance structure of the sketches. Moreover, since the benchmark approach is based on CUSUM, it needs a prescribed post-change mean vector (set to be an allone vector in our example). Hence, its performance may be affected by parameter misspecification. We compare the performance again in two settings, when all [µ] i are equal to a constant, and when 100p% entries of the post-change mean vector are larger. Fig. 8 demonstrates a significant amount of performance gain of our sketching procedure.

E. Solar flare detection
Consider a video from the Solar Data Observatory (SDO), which demonstrates an abrupt emergence of a solar flare 1 . Each frame is of size 232 × 292 pixels, which results in N = 67744 dimensional streaming data. In this video, the normal states are slowly drifting solar flares, and the anomaly is a much brighter transient solar flare at t = 223. The true change-point t = 223 is hand-picked. Fig. 9(a) provides a snapshot of the original SDO data at t = 150 when no solar flare exists. Fig. 9(b) provides a snapshot of the original SDO data at t = 223 when the solar flare emerges as a brighter curve in the middle of the image, which is not clearly visible. Before running our detection procedures, we preprocess the data by tracking and removing the slowly changing background with the MOUSSE algorithm and detect the change in the residuals (which corresponds to x t in our formulation).
We first verify the Gaussianity for the residuals (x t ) by the one-sample Kolmogorov-Smirnov (KS) test. The p-value returned by the KS test is 0.47 for the signal at t = 150, which indicates that the Gaussian distribution is a reasonable assumption. As an additional proof of Gaussianity, Fig. 10 shows the Quantile-Quantile (Q-Q) plot for the residuals at t = 150.
We apply the sketching procedure with fixed projection on the MOUSSE residuals, which performs dimensionality reduction. Choosing the sketching matrix A to be an M -by-N Gaussian random matrix with entries i.i.d. N (0, 1/N ). Note that the signal is deterministic in this case (since we use a deterministic video sequence). To evaluate consistency of the method, we run the procedure 500 times, each time using a different random Gaussian matrix as the fixed projection A. Fig. 11 shows the error-bars of the EDDs from 500 runs. As M increases, both the means and standard deviations of the EDDs decrease. When M is larger than 750, the EDD is less than 3 with very high probability, which means that our sketching detection procedure can reliably detect the solar flare with only 750 sketches. This is a significant reduction compared with using the original data with a dimensionality reduction ratio of 750/67744 ≈ 0.01. network is 1, as shown in Fig. 12 2 . The nodes represent generators, transformers and substations, and edges represent high-voltage transmission lines between them [38]. Note that the graph is sparse and that there are many "communities" which correspond to densely connected subnetworks. In this example, we simulate a situation for power failure over this large network. Assume at each time we may observe the real power injection at an edge. When the power system is in steady state, the observation is the true state plus Gaussian observation noise [39]. We may estimate the true state (e.g., using techniques in [39]), subtract it from the observation vector, and treat the residual vector as our signal x i , which can be assumed to be i.i.d. standard Gaussian distribution. When failures happen in a power system, there will be a shift in the mean for a small number of affected edges, since in practice, when there is a power failure, usually only a small part of the network is affected simultaneously.
To perform the sketching procedure, we consider A t s with N = 6594 and various M < N . Each time, we randomly choose M nodes in the network and measure the sum of the quantities over all attached edges as shown in Fig. 13. Note that in this case we construct a projection matrix that is neither a Gaussian matrix nor an expander graph, but the structure of the projection matrix is constrained by the network topology. Our experiment shows that our detection procedure performs well with such topology constrained A t s. Although our example is a highly simplified model for power networks, it sheds some light on the potential of our method applied to monitoring real power networks. In the following experiment, we set the threshold b such that the ARL is 5000. We assume that the means of 5% of the edges in the network increase by µ 0 . Fig. 14(a) shows the simulated EDD versus an increasing signal strength µ 0 . Note that the EDD using a small number of sketches is quite small if µ 0 is sufficiently large. For example, when µ 0 = 4, one may detect the change by observing from only M = 100 sketches, which is a significant reduction compared with using the original data with a ratio of 100/4941 ≈ 0.02.
Next, we fix the mean shift of affected edges by 1 and assume that 100p% edges are affected by the change. Fig. 14(b) shows the simulated EDD versus increasing p. Similarly, the EDD using a small number of sketches is quite small if p is large enough. If p is around 0.4, for example, one can detect the change by using only M = 50 sketches.   Power system example: A being a power network topology constrained sensing matrix. The standard deviation of each point is less than half of the value. (a) EDD versus µ 0 when we randomly select 5% edges with mean shift µ 0 ; (b) EDD versus p when we randomly select 100p% edges with mean shift 1. The minimum value of p is 0.05.

VI. SUMMARY AND DISCUSSIONS
In this paper, we studied the problem of sequential changepoint detection if the observations are linear projections of the high-dimensional signals. We were interested in the problem where the change-point causes an unknown shift in the mean of the signal, and one would like to detect such a change as quickly as possible. We presented new sketching procedures for fixed and time-varying projections, respectively. The sketching procedures were derived based on the generalized likelihood ratio statistic. We analyzed theoretically the performance of our procedures by deriving theoretical approximations to the average run length (ARL) when there is no change, and the expected detection delay (EDD) when there is a change. Our approximations were shown to be highly accurate numerically. We also explored the choice of the projection matrix theoretically, when it is chosen as the Gaussian random matrix or as a sparse expander graph. We demonstrate the good performance of our procedure using numerical simulations and two real-data examples for solar flare detection and failure detection in power networks.
Thus far, we have assumed that the data streams are independent. In practice, if the data streams are dependent with a known covariance matrix Σ, we can whiten the data streams by applying a linear transformation Σ −1/2 x t . Otherwise, the covariance matrix Σ can also be estimated using a training stage via regularized maximum likelihood methods (see [40] for an overview). Alternatively, we may estimate the covariance matrix Σ of the sketches AΣA or A t ΣA t directly, which requires fewer samples to estimate due to the lower dimensionality of the covariance matrix. Then we may build statistical change-point detection procedures using Σ (similar to what has been done for the projection Hotelling control chart in [9]), which we leave for future work.
One additional discussion would be the following. If the signal has certain structure we may exploit such structure. For instance, the post-change mean may lie on a low-dimensional subspace: µ = V β for some basis V ∈ R M ×N and coefficient vector β. If we know such a V , we could set A = V , and the ratio EDD(N )/EDD(M ) is proportional to In general, we have that the ratio EDD(N )/EDD(M ) is proportional to O( N M cos(θ min )), where θ min ∈ [0, π/2] is the least principle angle between the two subspaces formed by V and V . Hence if we know the true subspace in which the signal lies to a certain precision such that cos(θ min ) > N/M , the sketching procedure may achieve an even better performance. Exploiting the knowledge of subspace has also been considered in [7] in the context of SPC design.

ACKNOWLEDGEMENT
This work is partially supported by NSF grants CCF-1442635 and CMMI-1538746.

APPENDIX
We start by deriving the ARL and EDD for the sketching procedure.
Proofs for Theorem 1 and Theorem 2: This analysis demonstrates that the sketching procedure corresponds to the so-called mixture procedure (cf. T 2 in [5]) in a special case of p 0 = 1, M sensors, and the post-change mean vector is V µ. In [5], Theorem 1, it was shown that the ARL of the mixture procedure with parameter p 0 ∈ [0, 1] and M sensors is given by M 2 ) We are done deriving the ARL. The EDD can be derived by applying Theorem 2 of [5] in the case where ∆ = V µ , the number of sensors is M , and p 0 = 1.
The following proof is for the Gaussian random matrix A.
Proof of Theorem 3: It follows from (26), and a standard result concerning the distribution function of the beta distribution [41, 26.5.3], that where I is the regularized incomplete beta function (RIBF) [41, 6.6.2]. We first prove the lower bound in (28). Assuming N → ∞ such that (27) holds, we may combine (38) with [42,Theorem 4.18] to obtain , which proves the lower bound in (28). To prove the upper bound, it follows from (38) and a standard property of the RIBF [41, 6.6.3] that Assuming N → ∞ such that (27) and the argument now proceeds analogously to that for the lower bound.
The following proofs are for the sparse 0-1 matrix A.
Lemma 3. If a 0-1 matrix A has constant column sum d, for every non-negative vector x such that [x] i ≥ 0, we have Proof of Lemma 3: Lemma 4 (Bounding σ max (A)). If A corresponds to a (s, )expander with regular degree d and regular left degree c, for any nonnegative vector x, thus, Proof of Lemma 4: For any nonnegative vector x, Above, (43) holds since for a given column j, A ij = 1 holds for exactly d rows. And for each row i of these d rows, where σ max = σ max (A), and (45) holds since U is a unitary matrix. Thus, in order to bound ∆, we need to characterize σ max , as well as Aµ 2 for a s sparse vector µ. Combining