 Research
 Open Access
 Published:
Sketching for sequential changepoint detection
EURASIP Journal on Advances in Signal Processingvolume 2019, Article number: 42 (2019)
Abstract
We present sequential changepoint detection procedures based on linear sketches of highdimensional signal vectors using generalized likelihood ratio (GLR) statistics. The GLR statistics allow for an unknown postchange mean that represents an anomaly or novelty. We consider both fixed and timevarying projections, derive theoretical approximations to two fundamental performance metrics: the average run length (ARL) and the expected detection delay (EDD); these approximations are shown to be highly accurate by numerical simulations. We further characterize the relative performance measure of the sketching procedure compared to that without sketching and show that there can be little performance loss when the signal strength is sufficiently large, and enough number of sketches are used. Finally, we demonstrate the good performance of sketching procedures using simulation and realdata examples on solar flare detection and failure detection in power networks.
Introduction
Online changepoint detection from highdimensional streaming data is a fundamental problem arising from applications such as realtime monitoring of sensor networks, computer network anomaly detection, and computer vision (e.g., [2, 3]). To reduce data dimensionality, a conventional approach is sketching (see, e.g., [4]), which performs random projection of the highdimensional data vectors into lowerdimensional ones. Sketching has now been widely used in signal processing and machine learning to reduce dimensionality and algorithm complexity and achieve various practical benefits [5–11].
We consider changepoint detection using linear sketches of highdimensional data vectors. Sketching reduces the computational complexity of the detection statistic from \(\mathcal {O}(N)\) to \(\mathcal {O}(M)\), where N is the original dimensionality and M is the dimensionality of sketches. Since we would like to perform realtime detection, any reduction in computational complexity (without incurring much performance loss) is highly desirable. Sketching also offers practical benefits. For instance, for large sensor networks, it reduces the burden of data collection and transmission. It may be impossible to collect data from all sensors and transmit them to a central hub in real time, but this can be done if we only select a small subset of sensors to collect data at each time. Sketching also reduces data storage requirement. For instance, changepoint detection using the generalized likelihood ratio statistic, although robust, is nonrecursive. Thus, one has to store historical data. Using sketching, we only need to store the much lower dimensional sketches rather than the original highdimensional vectors.
In this paper, we present a new sequential sketching procedure based on the generalized likelihood ratio (GLR) statistics. In particular, suppose we may choose an M×N matrix A with M≪N to linearly project the original data: y_{t}=Ax_{t},t=1,2,…. Assume the prechange vector is zeromean Gaussian distributed and the postchange vector is Gaussian distributed with an unknown mean vector μ while the covariance matrix is unchanged. Here, we assume the mean vector is unknown since it typically represents an anomaly. The GLR statistic is formed by replacing the unknown μ with its maximum likelihood ratio estimator (e.g., [12]). Then we further generalize to the setting with timevarying projections A_{t} of dimension M_{t}×N. We demonstrate the good performance of our procedures by simulations, a real data example of solar flare detection, and a synthetic example of power network failure detection with data generated using realworld power network topology.
Our contribution
Our theoretical contribution is mainly in two aspects. We obtain analytic expressions for two fundamental performance metrics for the sketching procedures: the average run length (ARL) when there is no change and the expected detection delay (EDD) when there is a changepoint, for both fixed and timevarying projections. Our approximations are shown to be highly accurate using simulations. These approximations are quite useful in determining the threshold of the detection procedure to control false alarms, without having to resort to the onerous numerical simulations. Moreover, we characterize the relative performance of the sketching procedure compared to that without sketching. We examine the EDD ratio when the sketching matrix A is either a random Gaussian matrix or a sparse 01 matrix (in particular, an expander graph). We find that, as also verified numerically, when the signal strength and M are sufficiently large, the sketching procedure may have little performance loss. When the signal is weak, the performance loss can be large when M is too small. In this case, our results can be used to find the minimum M such that performance loss is bounded, assuming certain worst case signal and for a given target ARL value.
To the best of our knowledge, our work is the first to consider sequential changepoint detection using the generalized likelihood ratio statistic, assuming the postchange mean is unknown to represent an anomaly. The only other work [13] that considers changepoint detection using linear projections assumes the postchange mean is known and further to be sparse. Our results are more general since we do not make such assumptions. Assuming the postchange mean to be unknown provides a more useful procedure since in changepoint detection, the postchange setup is usually unknown. Moreover, [13] considers ShiryaevRobert’s procedure, which is based on a different kind of detection statistic than the generalized likelihood ratio statistic considered here. The theoretical analyses therein consider slightly different performance measures, the probability of false alarm, and average detection delay, and our analyses are completely different.
Our work is also distinctive from the existing Statistical Process Control (SPC) charts using random projections (reviewed below in Section 1.3) in that (1) we developed new theoretical results for the sequential GLR statistic, (2) we consider the sparse 01 and timevarying projections, and (3) we study the amount of dimensionality reduction can be performed (i.e., the minimum M) such that there is little performance loss.
Notations and outline
Our notations are standard: \({\chi ^{2}_{k}}\) denotes the chisquare distribution with degreeoffreedom k; I_{n} denotes an identity matrix of size n; X^{†} denotes the pseudoinverse of a matrix X; [x]_{i} denotes the ith coordinate of a vector x; [X]_{ij} denotes the ijth element of a matrix X; and \(\boldsymbol {x}^{\intercal }\) denotes the transpose of a vector or matrix x.
The rest of the sections are organized as follows. We first review some related work. Section 2 sets up the formulation of the sketching problem for sequential changepoint detection. Section 3 presents the sketching procedure. Section 4 contains the performance analysis of the sketching procedures. Section 5 and Section 6 demonstrate good performance of our sketching procedures using simulation and realworld examples. Section 7 concludes the paper. All proofs are delegated to the appendix.
Related work
In this paper, we use the term “sketching” in a broader sense to mean that our observations are linear projections of the original signals. We are concerned with how to perform sequential changepoint detection using these linear projections. The traditional sketching [4] is concerned with designing linear dimensionality reduction techniques to solve the inverse linear problem Ax=b, where b is of greater dimension than x. This can be cast as a problem of designing a dimensionality reduction (sketching) matrix S such that Sb=SAx is of smaller dimension to reduce computational cost. In our problem, the linear projections can be designed or they can be determined by problem setup (such as missing data or subsampling procedure).
A closely related work is [14], which considers onedimensional observations, and the prechange distribution is Gaussian with zero mean and unit variance, and the postchange distribution is Gaussian with unknown mean and unit variance. Siegmund and Venkatraman [14] also uses the GLR statistic, by estimating the postchange mean and plugin the likelihood ratio statistic. Our strategy for deriving the detection statistic is similar to [14]. However, there is one crucial difference. Since the number of linear projections is much smaller than the original dimension, we cannot obtain a unique MLE for the postchange mean vector, but can only determine the equation that the MLE needs to satisfy. Thus, the derivation and the analysis of GLR detection statistic for our setting are different from [14]. Another closely related work is [15]: we adapt a result therein (Theorem 1) in deriving the ARL and EDD of the sketching procedure, when the projection matrix is fixed. The scope of the two papers are quite different: [15] studies changepoint detection when the postchange mean is sparse, and here, we are not concerned with detecting sparse change but with detecting using linear projections of the original data; moreover, our analysis for the timevarying projection case is new and different from [15].
Changepoint detection problems are closely related to industrial quality control and multivariate statistical process control (SPC) charts, where an observed process is assumed initially to be in control and at a changepoint becomes out of control. The idea of using random projections for change detection has been explored for SPC in the pioneering work [16] based on U^{2} multivariate control chart, the followup work [17] for cumulative sum (CUSUM) control chart and the exponential weighting moving average (EWMA) schemes, and in [18, 19] based on the Hotelling statistic. These works provide a complementary perspective from SPC design, while our method takes a different approach and is based on sequential hypothesis testing, treating both the changepoint location and the postchange mean vector as unknown parameters. By treating the changepoint location as an unknown parameter when deriving the detection statistic, the sequential hypothesis testing approach overcomes the drawback of some SPC methods due to a lack of memory, such as the Shewhart chart and the Hotelling chart, since they cannot utilize the information embedded in the entire sequence [20]. Moreover, our sequential GLR statistic may be preferred over the CUSUM procedure in the setting when it is difficult to specify the postchange mean vector.
This paper extends on our preliminary work reported in [1] with several important extensions. We have added (1) timevarying sketching projections and their theoretical analysis, (2) extensive numerical examples to verify our theoretical results, and (3) new realdata examples of solar flare detection and power failure detection.
Our work is related to compressive signal processing [21], where the problem of interest is to estimate or detect (in the fixedsample setting) a sparse signal using compressive measurements. In [22], an offline test for a nonzero vector buried in Gaussian noise using linear measurements is studied; interestingly, a conclusion similar to ours is drawn that the task of detection within this setting is much easier than the tasks of estimation and support recovery. Another related work is [23], which considers a problem of identifying a subset of data streams within a larger set, where the data streams in the subset follow a distribution (representing anomaly) that is different from the original distribution; the problem considered therein is not a sequential changepoint detection problem as the “changepoint” happens at the onset (t=1). In [24], an offline setting is considered and the goal is to identify k out of n samples whose distributions are different from the normal distribution f_{0}. They use a “temporal” mixing of the samples over the finite time horizon. This is different from our setting since we project over the signal dimension at each time. Other related works include kernel methods [25, 26] that focus on offline changepoint detection. Finally, detecting transient changes in power systems has been studied in [27].
Another common approach to dimensionality reduction is principal component analysis (PCA) [28], which achieves dimensionality reduction by projecting the signal along the singular space of the leading singular values. In this case, A or A_{t} corresponds to the signal singular space. Our theoretical approximation for ARL and EDD can also be applied in these settings. It may not be easy to find the signal singular space when the dimensionality is high, since computing singular value decomposition can be expensive [29].
Formulation
Changepoint detection as sequential hypothesis test
Consider a sequence of observations with an open time horizon x_{1},x_{2},…,x_{t}, t=1,2,…, where x_{t}∈R^{N} and N is the signal dimension. Initially, the observations are due to noise. There can be a time κ such that an unknown changepoint occurs and it changes the mean of the signal vector. Such a problem can be formulated as the following hypothesis test:
where the unknown mean vector is defined as
Without loss of generality, we have assumed the noise variance is 1. Our goal is to detect the changepoint as soon as possible after it occurs, subjecting to the false alarm constraint. Here, we assume the covariance of the data to be an identity matrix and the change only happens to the mean.
To reduce data dimensionality, we linearly project each observation x_{t} into a lower dimensional space, which we refer to as sketching. We aim to develop procedures that can detect a changepoint using the lowdimensional sketches. In the following, we consider two types of linear sketching: the fixed projection and the timevarying projection.
Note that when the covariance matrix is known, the general problem is equivalent to (1), due to the following simple argument. Suppose we have the following hypothesis test:
where the covariance matrix Σ is positive definite. Denote the eigendecomposition as \(\Sigma = Q\Lambda Q^{\intercal }\). Now, transform each observation x_{t} using \(x_{t} = \Lambda ^{1/2} {Q^{\intercal }} x_{t}^{\prime }\), t=1,2,…, where Λ^{−1/2} is a diagonal matrix with the diagonal entries being the inverse of the square root of the diagonal entries of Λ. Then, the original hypothesis test can be written in the same form as (1), by defining \(\mu =\Lambda ^{1/2} {Q^{\intercal }} \mu ^{\prime }\).
Fixed projection. Choose an M×N (possibly random) projection matrix A with M≪N. We obtain lowdimensional sketches via:
Then the hypothesis test for the original problem (1), becomes the following hypothesis test based on the sketches (3)
Above, the distributions for the sketches are for given projections. Note that both mean and covariance structures are affected by the projections A.
Timevarying projection. In certain applications, one may use different sketching matrices at each time. The projections are denoted by \(\boldsymbol {A}_{t} \in \mathbb {R}^{M_{t}\times N}\) and the number of sketches M_{t} can change as well. The hypothesis test for sketches becomes:
Above, the distributions for the sketches are for given projections. Intuitively, for certain setting, the timevarying projection is preferred, e.g., when the postchange mean vector μ is sparse, and the observations corresponding to missing data (i.e., only observe a subset of entries). One would expect observing a different subset of entries at each time would be better, because if the missing locations overlap with sparse mean shift locations, then we will miss the signal entirely.
Sketching matrices
In this paper, we assume that (1) when the sketching matrices A or A_{i} are random, then they have to be full row rank with probability 1; (2) when A or A_{i} are deterministic, then they have to be full row rank. The sketching matrices can either be userspecified or determined by the physical sensing system and not user specified. Below, we give several examples. Examples (i)–(iv) correspond to situations where the projections are user designed, and example (v) (missing data) corresponds to the situation where the projections are imposed by the setup.

(i)
(Dimensionality reduction using random Gaussian matrices). To reduce the dimensionality of a highdimensional vector (i.e., to compress data), one may use random projections. For instance, random Gaussian matrices \(\boldsymbol {A}\in \mathbb {R}^{M\times N}\) whose entries are i.i.d. Gaussian with zero mean and variance equal to 1/M.

(ii)
(Expander graphs). Sketching matrices with {0,1} entries are also commonly used: such a scenario is encountered in environmental monitoring (see, e.g., [15, 30]). Expander graphs are “sparse” 01 matrices in the sense that very few entries are zero and thus are desired for efficient computation since each linear projection only requires summing a few dimensions of the original data vector. Due to good structural properties, they have been used in compressed sensing (e.g., [31]). We will discuss more details about the expander graph in Section 4.4.3.

(iii)
(Pairwise comparison). In applications such as social network data analysis and computer vision, we are interested in a pairwise comparison of variables [32, 33]. This can be modeled as observing the difference between a pair of variables, i.e., at each time t, the measurements are [x_{t}]_{i}−[x_{t}]_{j}, for a set of i≠j. There are a total of N^{2} possible comparisons, and we may randomly select M out of N^{2} such comparisons to observe. The pairwise comparison model leads to a structured fixed projection with only {0,1,− 1} entries.

(iv)
(PCA). There are also approaches to changepoint detection using principal component analysis (PCA) of the data streams (e.g., [28, 34]), which can be viewed as using a deterministic fixed projection A, which is precomputed as the signal singular space associated with the leading singular values of the data covariance matrix.

(v)
(Missing data). In various applications, we may only observe a subset of entries at each time (e.g., due to sensor failure), and the locations of the observed entries also vary with time [35]. This corresponds to \(\boldsymbol {A}_{t} \in \mathbb {R}^{M_{t} \times N}\) being a submatrix of an identity matrix by selecting rows from an index set Ω_{t} at time t. When the data are missing at random, each entry of A_{t} is i.i.d. Bernoulli random variables.
Methods: sketching procedures
Below, we derive the sketching procedure, when the projection matrices are fixed (across all times) and timevarying, respectively. In both cases, the MLE of the postchange mean vector cannot be uniquely determined generally. We tackle this issue and derive different generalized likelihood ratio (GLR) detection statistics and provide different analysis for the detection performances in the two cases.
Sketching procedure: fixed projection
Derivation of GLR statistic
We now derive the likelihood ratio statistic for the hypothesis test in (4). The strategy for deriving the GLR statistic in this case (with the fixed projection) is similar to [14]. However, [14] only considers the univariate case, where the MLE of the postchange mean can be obtained explicitly. Here, we consider the multidimensional case, and since the number of linear projections is much smaller than the original dimension, we cannot obtain a unique MLE for the postchange mean vector, but can only determine the equation that the MLE needs to satisfy; we need different derivation to obtain the GLR detection statistic.
Define the sample mean within a window [k,t]
Since the observations are i.i.d. over time, for an assumed changepoint κ=k, for the hypothesis test in (4), the loglikelihood ratio of observations accumulated up to time t>k, given the projection matrix A, becomes
where \(f_{0} = \mathcal {N}(0, \boldsymbol {AA}^{\intercal })\) denotes the probability density function of data under the null and \(f_{1}=\mathcal {N}(\boldsymbol {A{\mu }, AA}^{\intercal })\) denotes the probability density function of y_{i} under the alternative. Note that since A is full row rank (with probability 1), \(\boldsymbol {AA}^{\intercal }\) is invertible (with probability 1).
Since μ is unknown, the GLR statistic replaces it with a maximum likelihood estimator (MLE) for fixed values of k and t in the likelihood ratio (7) to obtain the logGLR statistic. Taking the derivative of ℓ(t,k,μ) in (7) with respect to μ and setting it to 0, we obtain an equation that the maximum likelihood estimator μ^{∗} of the postchange mean vector needs to satisfy:
or equivalently \(\boldsymbol {A}^{\intercal }\left [(\boldsymbol {AA}^{\intercal })^{1}\boldsymbol {A}\boldsymbol {\mu }^{*}(\boldsymbol {AA}^{\intercal })^{1}\bar {\boldsymbol {y}}_{t, k}\right ]=0.\) Note that since \(\boldsymbol {A}^{\intercal }\) is of dimension MbyN, this defines an underdetermined system of equations for the maximum likelihood estimator μ^{∗}. In other words, any μ^{∗} that satisfies
for a vector \(\boldsymbol {c}\in \mathbb {R}^{N}\) that lies in the null space of A, \(\boldsymbol {A}^{\intercal } \boldsymbol {c} = 0,\) is a maximum likelihood estimator for the postchange mean. In this case, we could use pseudoinverse to solve for μ^{∗}, but we choose not to do this as the resulted detection statistic is too complex to analyze. Rather, we choose a special solution by setting c=0, which will lead to a simple detection statistic and tractable theoretical analysis. Then, the corresponding maximum estimator satisfies the equation below:
Now substitute such a μ^{∗} into (7). Using (9), the first and second terms in (7) become, respectively,
Combining above, from (7), we have that the logGLR statistic is given by
Since the changepoint location k is unknown, when forming the detection statistic, we take the maximum over a set of possible locations, i.e., the most recent samples from t−w to t, where w>0 is the window size. Now we define the sketching procedure, which is a stopping time that stops whenever the logGLR statistic raises above a threshold b>0:
Here, the role of the window is twofold: it reduces the data storage when implementing the detection procedure and it establishes a minimum level of change that we want to detect.
Equivalent formulation of fixed projection sketching procedure
We can further simplify the logGLR statistic in (10) using the singular value decomposition (SVD) of A. This will facilitate the performance analysis in Section 4 and lead into some insights about the structure of the logGLR statistic. Let the SVD of A be given by
where \(\boldsymbol {U}\in \mathbb {R}^{M\times M}\), \(\boldsymbol {V}\in \mathbb {R}^{N\times M}\) are the left and right singular spaces, \(\boldsymbol {\Sigma } \in \mathbb {R}^{M\times M}\) is a diagonal matrix containing all nonzero singular values. Then \((\boldsymbol {A}\boldsymbol {A}^{\intercal })^{1} = \boldsymbol {U}\boldsymbol {\Sigma }^{2}\boldsymbol {V}^{\intercal }\). Thus, we can write the logGLR statistic (10) as
Substitution of the sample average (6) into (13) results in
Now define transformed data
Since under the null hypothesis \(\boldsymbol {y}_{i}  \boldsymbol {A} \sim \mathcal {N}(0, \boldsymbol {AA}^{\intercal })\), we have \(\boldsymbol {z}_{i} \sim \mathcal {N}(0, \boldsymbol {I}_{M})\). Similarly, under the alternative hypothesis \(\boldsymbol {y}_{i}\boldsymbol {A} \sim \mathcal {N}(\boldsymbol {A{\mu }}, \boldsymbol {AA}^{\intercal })\), we have \(\boldsymbol {z}_{i} \sim \mathcal {N}(\boldsymbol {V}^{\intercal } \boldsymbol {\mu }, \boldsymbol {I}_{M})\). Combing above, we obtain the following equivalent form for the sketching procedure in (11):
This form has one intuitive explanation: the sketching procedure essentially projects the data to form M (less than N) independent data streams and then form a logGLR statistic for these independent data streams.
Sketching procedure: timevarying projection
GLR statistic
Similarly, we can derive the log likelihood ratio statistic for the timevarying projections. For an assumed changepoint κ=k, using all observations from k+1 to time t, we find the log likelihood ratio statistic similar to (7):
Similarly, we replace the unknown postchange mean vector μ by its maximum likelihood estimator using data in [k+1,t]. Taking the derivative of ℓ(t,k,μ) in (15) with respect to μ and setting it to 0, we obtain an equation that the maximum likelihood estimator μ^{∗} needs to satisfy
Note that in the case of timevarying projection, we no longer have the structure in (8) for the fixed project. Thus, in this case, we will use a different strategy to derive the detection statistic based on pseudoinverse. One needs to discuss the rank of the matrix \(\sum _{i=k+1}^{t} \boldsymbol {A}_{i}^{\intercal }\left (\boldsymbol {A}_{i}\boldsymbol {A}_{i}^{\intercal }\right)^{1}\boldsymbol {A}_{i}\) on the lefthand side of (16). Define the SVD of \(\boldsymbol {A}_{i} = \boldsymbol {U}_{i} \boldsymbol {D}_{i} \boldsymbol {V}_{i}^{\intercal }\) with \(\boldsymbol {U}_{i} \in \mathbb {R}^{M_{i} \times M_{i}}\) and \(\boldsymbol {V}_{i} \in \mathbb {R}^{N \times M_{i}}\) being the eigenspaces and \(\boldsymbol {D}_{i} \in \mathbb {R}^{M_{i} \times M_{i}}\) being a diagonal matrix that contains all the singular values. We have that
where \(\boldsymbol {Q}=[\boldsymbol {V}_{k+1}, \ldots, \boldsymbol {V}_{t}] \in \mathbb {R}^{N \times S}\) and \(S = \sum _{i=k+1}^{t} M_{i}\). Consider the rank of \(\sum _{i=k+1}^{t} \boldsymbol {A}_{i}^{\intercal }\left (\boldsymbol {A}_{i}\boldsymbol {A}_{i}^{\intercal }\right)^{1}\boldsymbol {A}_{i}\) for two cases below:
From above, we can see that this matrix is rank deficient when t−k<N/M, i.e., the number of postchange samples t−k is small. However, this is generally the case since we want to detect the change quickly once it occurs. Since the matrix in (17) is noninvertible in general, we use the pseudoinverse of the matrix. Define
From (16), we obtain an estimate of the maximum likelihood estimator for the postchange mean
Substituting such a μ^{∗} into (15), we obtain the logGLR statistic for timevarying projection:
Timevarying 01 project matrices
To further simplify the expression of GLR in (18), we consider a special case when A_{t} has only one entry equal to 1 for each row and all other entries equal to 0. This means that at each time, we only observe a subset of the entries and can correspond to the missing data case. Now \(\boldsymbol {A}_{t} \boldsymbol {A}_{t}^{\intercal }\) is an M_{t}by M_{t} identity matrix, and \(\boldsymbol {A}_{t}^{\intercal } \boldsymbol {A}_{t}\) is a diagonal matrix. For a diagonal matrix \(\boldsymbol {D}\in \mathbb {R}^{N\times N}\) with diagonal entries λ_{1},…,λ_{N}, the pseudoinverse of D is a diagonal matrix with diagonal entries \(\lambda _{i}^{1}\) if λ_{i}≠0 and with diagonal entries 0 if λ_{i}=0. Let the index set of the observed entries at time t be Ω_{t}. Define indicator variables
Then, the logGLR statistic in (18) becomes
Hence, for 01 matrices, the sketching procedure based on logGLR statistic is given by
where b>0 is the prescribed threshold and w is the window length. Note that the logGLR statistic essentially computes the sum of each entry within the time window [t−w,t) and then averages the squared sum.
Results: Theoretical
In this section, we present theoretical approximations to two performance metrics, the average run length (ARL), which captures the false alarm rate, and the expected detection delay (EDD), which captures the power of the detection statistic.
Performance metrics
We first introduce some necessary notations. Under the null hypothesis in (1), the observations are zero mean. Denote the probability and expectation in this case by \(\mathbb {P}^{\infty }\) and \(\mathbb {E}^{\infty }\), respectively. Under the alternative hypothesis, there exists a changepoint κ, 0≤κ<∞ such that the observations have mean μ for all t>κ. Probability and expectation in this case are denoted by \(\mathbb {P}^{\kappa }\) and \(\mathbb {E}^{\kappa }\), respectively.
The choice of the threshold b involves a tradeoff between two standard performance metrics that are commonly used for analyzing changepoint detection procedures [15]: (i) the ARL, defined to be the expected value of the stopping time when there is no change, and (ii) the EDD, defined to be the expected stopping time in the extreme case where a change occurs immediately at κ=0, which is denoted as \(\mathbb {E}^{0}\{T\}\).
The following argument from [14] explains why we consider \(\mathbb {E}^{0}\{T\}\). When there is a change at κ, we are interested in the expected delay until its detection, i.e., the conditional expectation \(\mathbb {E}^{\kappa }\{T\kappa T > \kappa \}\), which is a function of κ. When the shift in the mean only occurs in the positive direction [μ]_{i}≥0, it can be shown that \(\sup _{\kappa } \mathbb {E}^{\kappa }\{T\kappa T > \kappa \} = \mathbb {E}^{0}\{T\}\). It is not obvious that this remains true when [μ]_{i} can be either positive or negative. However, since \(\mathbb {E}^{0}\{T\}\) is certainly of interest and reasonably easy to analyze, it is common to consider \(\mathbb {E}^{0}\{T\}\) in the literature and we also adopt this as a surrogate.
Fixed projection
Define a special function (cf. [36], page 82)
where Φ denotes the cumulative probability function (CDF) for the standard Gaussian with zero mean and unit variance. For numerical purposes, a simple and accurate approximation is given by (cf. [37])
where ϕ denotes the probability distribution function (PDF) for standard Gaussian. We obtain an approximation to the ARL of the sketching procedure with a fixed projection as follows:
Theorem 1
[ARL, fixed projection]
Assume that 1≤M≤N, b→∞ with M→∞ and b/M fixed. Then, with w=o(b^{r}) for some positive integer r, for a given projection matrix A that is full rank deterministically or with probability 1, the ARL of the sketching procedure defined in (11) is given by
where
This theorem gives an explicit expression for ARL as a function of the threshold b, the dimension of the sketches M, and the window length w. As we will show below, the approximation to ARL given by Theorem 1 is highly accurate. On a higher level, this theorem characterizes the mean of the stopping time, when the detection statistic is driven by noise. The requirement for w=o(b^{r}) for some positive integer r comes from [15] that our results are based on; this ensures the correct scaling when we pass to the limit. This essentially requires that the window length be large enough when the threshold b increases. In practice, w has to be large enough so that it does not cause a miss detection: w has to be longer than the anticipated expected detection delay as explained in [15].
Moreover, we obtain an approximation to the EDD of the sketching procedure with a fixed projection as follows. Define
where V contains the left singular vectors of A. Let \(\tilde {S}_{t}\triangleq \sum _{i=1}^{t}\delta _{i}\) be a random walk where the increments δ_{i} are independent and identically Gaussian distributed with mean Δ^{2}/2 and variance Δ^{2}. We can find the expected value of the minimum is given by [37]
where (x)^{−}=− min{x,0}, and the infinite series converges and can be evaluated easily numerically. Also, define
Theorem 2
[EDD, fixed projection] Suppose b→∞ with other parameters held fixed. Then, for a given matrix A with the right singular vectors V, the EDD of the sketching procedure (11) when κ=0 is given by
The theorem finds an explicit expression for the EDD as a function of threshold b, the number of sketches M, and the signal strength captured through Δ which depends on the post mean vector μ and the projection subspace V.
The proofs for the above two theorems utilize the equivalent form of T in (14) and draw a connection of the sketching procedure to the socalled mixture procedure (cf. T_{2} in [15]) when M sensors are affected by the change, and the postchange mean vector is given by \(\boldsymbol {V}^{\intercal } \boldsymbol {\mu }\).
Accuracy of theoretical approximations
Consider a A generated as a Gaussian random matrix, with entries i.i.d. \(\mathcal {N}(0, 1/N)\). Using the expression in Theorem 1, we can find the threshold b such that the corresponding ARL is equal to 5000. This can be done conveniently; since the ARL is an increasing function of the threshold b, we can use bisection to find such a threshold b. Then, we compare it with a threshold b found from the simulation.
As shown in Table 1, the threshold found using Theorem 1 is very close to that obtained from simulation. Therefore, even if the theoretical ARL approximation is derived for N tends to infinity, it is still applicable when N is large but finite. Theorem 1 is quite useful in determining a threshold for a targeted ARL, as simulations for large N and M can be quite timeconsuming, especially for a large ARL (e.g., 5000 or 10,000).
Moreover, we simulate the EDD for detecting a signal such that the postchange mean vector μ has all entries equal to a constant [μ]_{i}=0.5. As also shown in Table 1, the approximation for EDD using Theorem 2 is quite accurate.
We have also verified that the theoretical approximations are accurate for the expander graphs and details omitted here since they are similar.
Consequence
Theorems 1 and 2 have the following consequences:
Remarks 1
For a fixed large ARL, when M increases, the ratio M/b is bounded between 0.5 and 2. This is a property quite useful for establishing results in Section 4.4. This is demonstrated numerically in Fig. 1 when N=100, w=200, for a fixed ARL being 5000. The corresponding threshold b is found using Theorem 1, when M increases from 10 to 100. More precisely, Theorem 1 leads to the following corollary:
Corollary 1
Assume a large constant γ∈(e^{5},e^{20}). Let w≥100. For any large enough M>24.85, the threshold b such that the corresponding ARL \(\mathbb {E}^{\infty }\{T\} =\gamma \) satisfies M/b∈(0.5,2). In other words, max{M/b,b/M}≤2.
Note that e^{20} is on the order of 5×10^{8}; hence, it means that ARL can be very large; however, it is still bounded above (this means that the corollary holds for an nonasymptotic regime).
Remarks 2
As b→∞, the first order approximation to the EDD in Theorem 2 is given by b/(Δ^{2}/2), i.e., the threshold b divided by the KullbackLeibler (KL) divergence (see, e.g., [15] shows that Δ^{2}/2 is the KL divergence between \(\mathcal {N}(0, \boldsymbol {I}_{M})\) and \(\mathcal {N}(\boldsymbol {V}^{\intercal } \boldsymbol {\mu }, \boldsymbol {I}_{M})\)). This is consistent with our intuition since the expected increment of the detection statistics is roughly the KL divergence of the test. For finite b, especially when the signal strength is weak and when the number of sketches M is not large enough, the other terms than b/(Δ^{2}/2) will play a significant role in determining the EDD.
Timevarying 01 random projection matrices
Below, we obtain approximations to ARL and EDD for T_{{0,1}}, i.e., when 01 sketching matrices are used. We assume a fixed dimension M_{t}=M,∀t>0. We also assume that at each time t, we randomly select M out of N dimensions to observe. Hence, at each time, each signal dimension has a probability
to be observed. The sampling scheme is illustrated in Fig. 2, when N=10 and M=3 (the number of the dots in each column is 3) over 17 consecutive time periods from time k=t−17 to time t.
For such a sampling scheme, we have the following result:
Theorem 3
[ARL, timevarying 01 random projection] Let r=M/N. Let b^{′}=b/r. When b→∞, for the procedure defined in (21), we have that
where c(N,b^{′},w)is defined by replacing b with b^{′} in (23).
Moreover, we can obtain an approximation to EDD of T_{{0,1}}, as justified by the following arguments. First, relax the deterministic constraint that at each time we observe exactly M out of N entries. Instead, assume a random sampling scheme that at each time we observe one entry of x_{i} with probability r, 1≤n≤N. Consider i.i.d. Bernoulli random variables ξ_{ni} with parameter r for 1≤n≤N and i≥1. Define
Based on this, we define a procedure whose behavior is arguably similar to T_{{0,1}}:
where b>0 is the prescribed threshold. Then, using the arguments in Appendix 2, we can show that the approximation to EDD of this procedure is given by
and we use this to approximate the EDD of T_{{0,1}}.
Table 2 shows the accuracy of the approximations for ARL in (26) and for EDD in (27) with various M^{′}s when N=100, w=200, and all entries of [μ]_{i}=0.5. The results show that the thresholds b obtained using the theoretical approximations and that the EDD approximations are both very accurate.
Bounding relative performance
In this section, we characterize the relative performance of the sketching procedure compared to that without sketching (i.e., using the original logGLR statistic). We show that the performance loss due sketching can be small, when the signaltonoise ratio and M are both sufficiently large. In the following, we focus on fixed projection to illustrate this point.
Relative performance metric
We consider a relative performance measure, which is the ratio of EDD using the original data (denoted as EDD(N), which corresponds to A=I), versus the EDD using the sketches (denoted as EDD(M))
We will show that this ratio depends critically on the following quantity
which is the ratio of the KL divergence after and before the sketching.
We start by deriving the relative performance measure using theoretical approximations we obtained in the last section. Recall the expression for EDD approximation in (25). Define
From Theorem 2, we obtain that the EDD of the sketching procedure is proportional to
Let b_{N} and b_{M} be the thresholds such that the corresponding ARLs are 5000, for the procedure without sketching and with M sketches, respectively. Define Q_{M}=M/b_{M}, Q_{N}=N/b_{N} and
Using the definitions above, we have
Discussion of factors in (31)
We can show that P≥1 for sufficiently large M and large signal strength. This can be verified numerically. Since all quantities that P depends on can be computed explicitly: the thresholds b_{N} and b_{M} can be found from Theorem 1 once we set a target ARL, the h function can be evaluated using (29) which depends explicitly on Δ and M. Figure 3 shows the value of P when N=100 and all the entries of the postchange mean vector [μ]_{i} are equal to a constant value that varies across the xaxis. Note that P is less than 1 only when the signal strength μ_{i} are small and M is small. Thus, we have,
for sufficiently large M and signal strength Δ.
Using Corollary 1, we have that Q_{M}∈(0.5,2) and Q_{N}∈(0.5,2), and hence, a lower bound of the ratio EDD(N)/EDD(M) is between (1/4)(N/M)Γ and 4(N/M)Γ, for large M or large signal strength.
Next, we will bound Γ when A is a Gaussian matrix and an expander graph, respectively.
Bounding Γ
Gaussian matrix. Consider \(\boldsymbol {A}\in \mathbb {R}^{M\times N}\) whose entries are i.i.d. Gaussian with zero mean and variance equal to 1/M. First, we have the following lemma
Lemma 1
[38] Let \(\boldsymbol {A}\in \mathbb {R}^{M\times N}\) have i.i.d. \(\mathcal {N}(0, 1)\)entries. Then, for any fixed vector μ, we have that
More related results can be found in [39]. Since the Beta(α,β) distribution has a mean α/(α+β), we have that
We may also show that, provided M and N grow proportionally, Γ converges to its mean value at a rate exponential in N. Define δ∈(0,1) to be
We have the following result.
Theorem 4
[Gaussian A] Let \(\boldsymbol {A}\in \mathbb {R}^{M\times N}\) have entries i.i.d. \(\mathcal {N}(0, 1)\). Let N→∞ such that (33) holds. Then, for 0<ε< min(δ,1−δ), we have that
Hence, for Gaussian A, Γ is approximately M/N with probability 1.
Note that Theorem 4 is different from the restricted isometry property (RIP) invoked in compressed sensing, since here we assume one fixed and given vector μ, but in compressed sensing, one cares about norm preservation uniformly for all sparse vectors (with the same sparsity level) with probability 1.
Expander graph A. We can show that for expander graphs, Γ is also bounded. This holds for the “onesided” changes, i.e., the postchange mean vector is elementwise positive.
A matrix A corresponds to a (s,ε)expander graph with regular right degree d if and only if each column of A has exactly d “1”s, and for any set S of right nodes with S≤s, the set of neighbors \(\mathcal {N}(S)\) of the left nodes has size \(\mathcal {N}(S)\geq (1\epsilon) d S\). If it further holds that each row of A has c “1”s, we say A corresponds to a (s,ε)expander with regular right degree d and regular left degree c.
Assume [μ]_{i}≥0 for all i. Let \(\boldsymbol {A} \in \mathbb {R}^{M\times N}\) be consisting of binary entries, which corresponds to a bipartite graph, illustrated in Fig. 4. We further consider a bipartite graph with regular left degree c (i.e., the number of edges from each variable node is c) and regular right degree d (i.e., the number of edges from each parity check node is d), as illustrated in Fig. 4. Hence, this requires Nc=Md. Expander graphs satisfy the above requirements. The existence of expander graphs is established in [40]:
Lemma 2
[40] For any fixed ε>0 and \(\rho \triangleq M/N <1\), when N is sufficiently large, there always exists an (αN,ε) expander with a regular right degree d and a regular left degree c for some constants α∈(0,1), d and c.
Theorem 5
[Expander A] If A corresponds to a (s,ε)expander with regular degree d and regular left degree c, for any nonnegative vector [μ]_{i}≥0, we have that
Hence, for expander graphs, Γ is approximately greater than M/N·(1/d), where d is a small number.
Consequence
Combine the results above, we showed that for the regime where M and the signal strength are sufficiently large, the performance loss can be small (as indeed observed from our numerical examples). In this regime, when A is a Gaussian random matrix, the relative performance measure EDD(N)/EDD(M) is a constant, under the conditions in Corollary 1. Moreover, when A is a sparse 01 matrix with d nonzero entries on each row (in particular, an expander graph), the ratio (31) EDD(N)/EDD(M) is lower bounded by (1/4)·d/(1−ε) for some small number ε>0, when Corollary 1 holds.
There is one intuitive explanation. Unlike in compressed sensing, where the goal is to recover a sparse signal and one needs the projection to preserve norm up to a factor through the restricted isometry property (RIP) [41], our goal is to detect a nonzero vector in Gaussian noise, which is a much simpler task than compressed sensing. Hence, even though the projection reduces the norm of the vector, as long as the projection does not diminish the signal is normal below the noise floor.
On the other hand, when the signal is weak, and M is not large enough, there can be significant performance loss (as indeed observed in our numerical examples) and we cannot lower bound the relative performance measure. Fortunately, in this regime, we can use our theoretical results in Theorems 1 and 2 to design the number of sketches M for an anticipated worstcase signal strength Δ, or determine the infeasibility of the problem, i.e., it is better not to use sketching since the signal is too weak.
Results: numerical examples
In this section, we present numerical examples to demonstrate the performance of the sketching procedure. We focus on comparing the sketching procedure with the GLR procedure without sketching (by letting A=I in the sketching procedure). We also compare the sketching procedures with a standard multivariate CUSUM using sketches.
In the subsequent examples, we select ARL to be 5000 to represent a low false detection rate (similar choice has been made in other sequential changepoint detection work such as [15]). In practice, however, the target ARL value depends on how frequent we can tolerate false detection (e.g., once a month or once a year). Below, EDD _{o} denotes the EDD when A=I (i.e., no sketching is used). All simulated results are obtained from 10^{4} repetitions. We also consider the minimum number of sketches
such that the corresponding sketching procedure is only δ sample slower than the full procedure. Below, we focus on the delay loss δ=1.
Fixed projection, Gaussian random matrix
First, consider Gaussian A with N=500 and different number of sketches M<N.
EDD versus signal magnitude
Assume the postchange mean vector has entries with equal magnitude: [μ]_{i}=μ_{0}, to simplify our discussion. Figure 5a shows EDD versus an increasing signal magnitude μ_{0}. Note that when μ_{0} and M are sufficiently large, the sketching procedure can approach the performance of the procedure using the full data as predicted by our theory. When signal is weak, we have to use a much larger M to prevent a significant performance loss (and when signal is too weak, we cannot use sketching). Table 3 shows M_{min} for each signal strength; we find that when μ_{0} is sufficiently large, we may even use M_{min} less than 30 for an N=500 to have little performance loss. Note that here, we do not require signals to be sparse.
EDD versus signal sparsity
Now assume that the postchange mean vector is sparse: only 100p% entries μ_{i} being 1 and the other entries being 0. Figure 5b shows EDD versus an increasing p. Note that as p increases, the signal strength also increases; thus, the sketching procedure will approach the performance using the full data. Similarly, the M_{min} required is listed in Table 4. For example, when p=0.5, we find that one can use M_{min}=100 for an N=500 with little performance loss.
Fixed projection, expander graph
Now assume A being an expander graph with N=500 and different number of sketches M<N. We run the simulations with the same settings as those in Section 5.1.
EDD versus signal magnitude
Assume the postchange mean vector [μ]_{i}=μ_{0}. Figure 6a shows EDD with an increasing μ_{0}. Note that the simulated EDDs are smaller than those for the Gaussian random projections in Fig. 5. A possible reason is that the expander graph is better at aggregating the signals when [μ]_{i} are all positive. However, when [μ]_{i} can be either positive or negative, the two choices of A have similar performance, as shown in Fig. 7, where [μ]_{i} are drawn i.i.d. uniformly from [ − 3,3].
EDD versus signal sparsity
Assume that the postchange mean vector has only 100p% entries [μ]_{i} being 1 and the other entries being 0. Figure 6b shows the simulated EDD versus an increasing p. As p tends to 1, the sketching procedure approaches the performance using the full data.
Timevarying projections with 01 matrices
To demonstrate the performance of the procedure T_{{0,1}} (21) using timevarying projection with 01 entries, again, we consider two cases: the postchange mean vector [μ]_{i}=μ_{0} and the postchange mean vector has 100p% entries [μ]_{i} being 1 and the other entries being 0. The simulated EDDs are shown in Fig. 8. Note that T_{{0,1}} can detect change quickly with a small subset of observations. Although EDDs of T_{{0,1}} are larger than those for the fixed projections in Figs. 5 and 6, this example shows that projection with 01 entries can have little performance loss in some cases, and it is still a viable candidate since such projection means a simpler measurement scheme.
Comparison with multivariate CUSUM
We compare our sketching method with a benchmark adapted from the conventional multivariate CUSUM procedure [42] for the sketches. A main difference is that in multivariate CUSUM, one needs a prescribed postchange mean vector (which is set to be an allone vector in our example), rather than estimate it as the GLR statistic does. Hence, its performance may be affected by parameter misspecification. We compare the performance again in two settings, when all [μ]_{i} are equal to a constant and when 100p% entries of the postchange mean vector are positive valued. In Fig. 9, the logGLRbased sketching procedure performs much better than the multivariate CUSUM.
Examples for real applications
Solar flare detection
We use our method to detect a solar flare in a video sequence from the Solar Data Observatory (SDO)^{Footnote 1}. Each frame is of size 232×292 pixels, which results in an ambient dimension N=67,744. In this example, the normal frames are slowly drifting background sun surfaces, and the anomaly is a much brighter transient solar flare emerges at t=223. Figure 10a is a snapshot of the original SDO data at t=150 before the solar flare emerges, and Fig. 10b is a snapshot at t=223 when the solar flare emerges as a brighter curve in the middle of the image. We preprocess the data by tracking and removing the slowly changing background with the MOUSSE algorithm [43] to obtain tracking residuals. The Gaussianity for the residuals, which corresponds to our x_{t}, is verified by the KolmogorovSmirnov test. For instance, the p value is 0.47 for the signal at t=150, which indicates that the Gaussianity is a reasonable assumption.
We apply the sketching procedure with fixed projection to the MOUSSE residuals, choosing the sketching matrix A to be an MbyN Gaussian random matrix with entries i.i.d. \(\mathcal {N}(0,1/N)\). Note that the signal is deterministic in this case. To evaluate our method, we run the procedure 500 times, each time using a different random Gaussian matrix as the fixed projection A. Figure 11 shows the error bars of the EDDs from 500 runs. As M increases, both the means and standard deviations of the EDDs decrease. When M is larger than 750, EDD is often less than 3, which means that our sketching detection procedure can reliably detect the solar flare with only 750 sketches. This is a significant reduction, and the dimensionality reduction ratio is 750/67,744≈0.01.
Changepoint detection for power systems
Finally, we present a synthetic example based on the real power network topology. We consider the Western States Power Grid of the USA, which consists of 4941 nodes and 6594 edges. The minimum degree of a node in the network is 1, as shown in Fig. 12. The nodes represent generators, transformers, and substations, and edges represent highvoltage transmission lines between them [44]. Note that the graph is sparse and that there are many “communities” which correspond to densely connected subnetworks.
In this example, we simulate a situation for power failure over this large network. Assume that at each time, we may observe the real power injection at an edge. When the power system is in a steady state, the observation is the true state plus Gaussian observation noise [45]. We may estimate the true state (e.g., using techniques in [45]), subtract it from the observation vector, and treat the residual vector as our signal x_{i}, which can be assumed to be i.i.d. standard Gaussian. When a failure happens in a power system, there will be a shift in the mean for a small number of affected edges, since in practice, when there is a power failure, usually only a small part of the network is affected simultaneously.
To perform sketching, at each time, we randomly choose M nodes in the network and measure the sum of the quantities over all attached edges as shown in Fig. 13. This corresponds to At′s with N=6594 and various M<N. Note that in this case, our projection matrix is a 01 matrix whose structure is constrained by the network topology. Our example is a simplified model for power networks and aims to shed some light on the potential of our method applied to monitoring real power networks.
In the following experiment, we assume that on average, 5% of the edges in the network increase by μ_{0}. Set the threshold b such that the ARL is 5000. Figure 14 shows the simulated EDD versus an increasing signal strength μ_{0}. Note that the EDD from using a small number of sketches is quite small if μ_{0} is sufficiently large. For example, when μ_{0}=4, one may detect the change by observing from only M=100 sketches (when the EDD is increased only by one sample), which is a significant dimensionality reduction with a ratio of 100/4941≈0.02.
Conclusion
In this paper, we studied the problem of sequential changepoint detection when the observations are linear projections of the highdimensional signals. The changepoint causes an unknown shift in the mean of the signal, and one would like to detect such a change as quickly as possible. We presented new sketching procedures for fixed and timevarying projections, respectively. Sketching is used to reduce the dimensionality of the signal and thus computational complexity; it also reduces data collection and transmission burdens for large systems.
The sketching procedures were derived based on the generalized likelihood ratio statistic. We analyzed the theoretical performance of our procedures by deriving approximations to the average run length (ARL) when there is no change, and the expected detection delay (EDD) when there is a change. Our approximations were shown to be highly accurate numerically and were used to understand the effect of sketching.
We also characterized the relative performance of the sketching procedure compared to that without sketching. We specifically studied the relative performance measure for fixed Gaussian random projections and expander graph projections. Our analysis and numerical examples showed that the performance loss due to sketching could be quite small in a big regime when the signal strength and the dimension of sketches M are sufficiently large. Our result can also be used to find the minimum required M given a worstcase signal and a target ARL. In other words, we can determine the region where sketching results in little performance loss. We demonstrate the good performance of our procedure using numerical simulations and two realworld examples for solar flare detection and failure detection in power networks.
On a high level, although after sketching, the KullbackLeibler (KL) divergence becomes smaller, the thresholdb for the same ARL also becomes smaller. For instance, for Gaussian matrix, the reduction in KL divergence is compensated by the reduction of the threshold b for the same ARL, because the factor that they are reduced by are roughly the same. This leads to the somewhat counterintuitive result that the EDDs with and without sketching turns to be similar in this big regime.
Thus far, we have assumed that the data streams are independent. In practice, if the data streams are dependent on a known covariance matrix Σ, we can whiten the data streams by applying a linear transformation Σ^{−1/2}x_{t}. Otherwise, the covariance matrix Σ can also be estimated using a training stage via regularized maximum likelihood methods (see [46] for an overview). Alternatively, we may estimate the covariance matrix Σ^{′} of the sketches \(\boldsymbol {A}\boldsymbol {\Sigma } \boldsymbol {A}^{\intercal }\) or \(\boldsymbol {A}_{t}\boldsymbol {\Sigma } \boldsymbol {A}_{t}^{\intercal }\) directly, which requires fewer samples to estimate due to the lower dimensionality of the covariance matrix. Then, we can build statistical changepoint detection procedures using Σ^{′} (similar to what has been done for the projection Hotelling control chart in [19]), which we leave for future work. Another direction of future work is to accelerate the computation of sketching using techniques such as those in [47].
\thelikesection Appendix 1: Proofs
We start by deriving the ARL and EDD for the sketching procedure.
Proofs for Theorems 1 and 2
This analysis demonstrates that the sketching procedure corresponds to the socalled mixture procedure (cf. T_{2} in [15]) in a special case of p_{0}=1, M sensors, and the postchange mean vector is \(\boldsymbol {V}^{\intercal } \boldsymbol {\mu }\). In [15], Theorem 1, it was shown that the ARL of the mixture procedure with parameter p_{0}∈[0,1] and M sensors is given by
where the detection statistic will search within a time window m_{0}≤t−k≤m_{1}. Let \(g(x,p_{0}) = \log (1p_{0} + p_{0}e^{x^{2}/2})\). Then, \(\psi (\theta) = \log \mathbb {E}\{e^{\theta g(U, p_{0})}\}\) is the log moment generating function (MGF) for \(g(U, p_{0}), U\sim \mathcal {N}(0, 1)\), θ_{0} is the solution to \(\dot {\psi }(\theta) = b/M\),
and
Note that U^{2} is \({\chi ^{2}_{1}}\) distributed, whose MGF is given by \(\mathbb {E}\left \{e^{\theta U^{2}}\right \}=1/\sqrt {12\theta }\). Hence, when p_{0}=1,
The firstorder and secondorder derivative of the log MGF are given by, respectively,
Set \(\dot {\psi }(\theta _{0}) = b/M\). We obtain the solution that 1−θ_{0}=M/(2b), and θ_{0}=1−M/(2b). Hence, \(\ddot {\phi }(\theta _{0}) = 2b^{2}/M^{2}\). We have g(x,1)=x^{2}/2, and \(\dot {g}(x, 1) = x\).
where
Combining the above, we have that the ARL of the sketching procedure is given by
Next, using the fact that 1/(1−θ_{0})=2b/M, we have that the two terms in the above expression can be written as
then (39) becomes
Finally, note that we can also write
and the constant is
We are done deriving the ARL. The EDD can be derived by applying Theorem 2 of [15] in the case where \(\Delta = \\boldsymbol {V}^{\intercal }\boldsymbol {\mu }\\), the number of sensors is M, and p_{0}=1. □
The following proof is for the Gaussian random matrix A.
Proof of Theorem 4
It follows from (32), and a standard result concerning the distribution function of the beta distribution ([48], 26.5.3) that
where I is the regularized incomplete beta function (RIBF) ([48], 6.6.2). We first prove the lower bound in (34). Assuming N→∞ such that (33) holds, we may combine (41) with ([49], Theorem 4.18) to obtain
from which it follows that there exists \(\tilde {N}\) such that for all \(N\geq \tilde {N}\),
which rearranges to give
which proves the lower bound in (34). To prove the upper bound, it follows from (41) and a standard property of the RIBF ([48], 6.6.3) that
Assuming N→∞ such that (33) holds, we may combine (42) with ([49], Theorem 4.18) to obtain
and the argument now proceeds analogously to that for the lower bound. □
Lemma 3
If a 01 matrix A has constant column sum d, for every nonnegative vector x such that [ x]_{i}≥0, we have
Proof of Lemma 3
Below, A_{ij}=[A]_{ij}.
□
Lemma 4
[Bounding σ_{max}(A)] If A corresponds to a (s,ε)expander with regular degree d and regular left degree c, for any nonnegative vector x,
thus,
Proof of Lemma 4
For any nonnegative vector x,
Above, (46) holds since for a given column j, A_{ij}=1 holds for exactly d rows. And for each row i of these d rows, A_{il}=1 for exactly c columns with l∈{1,…,p}; (47) holds since dN=Mc. Finally, from the definition of σ_{max}, (45) holds. □
Proof for Theorem 5
Note that
where σ_{max}=σ_{max}(A), and (48) holds since U is a unitary matrix. Thus, in order to bound Δ, we need to characterize σ_{max}, as well as ∥Aμ∥_{2} for a s sparse vector μ. Combining (48) with Lemma 3 and 4, we have that for every nonnegative vector μ, [μ]_{i}≥0,
Finally, Lemma 2 characterizes the quantity [M(1−ε)/(dN)]^{1/2} in (49) and establishes the existence of such an expander graph. When A corresponds to an (αN,ε) expander described in Lemma 2, Δ≥∥βμ∥_{2} for all nonnegative signals [μ]_{i}≥0 for some constant α and some constant β=(ρ(1−ε)/d)^{1/2}. Done. □
Proof for Corollary 1
We define that \(x \triangleq M/b\), then Theorem 1 tells us that when M goes to infinity, we have that
where
and
Define that \(\gamma \triangleq \mathbb {E}^{\infty }\{T\}\). One claim is that when M>24.85 and γ∈[e^{5},e^{20}], there exists one x^{∗}∈(0.5,2) such that (50) holds. Next, we prove the claim.
Define the logarithm of the righthand side of (50) as follows:
Since ν(u)→1 as u→0 and \(\nu (u) \rightarrow \frac {2}{u^{2}}\) as u→∞, we know that \(\int _{0}^{\infty } u\nu ^{2}(u)du\) exists. From the numerical integration, we know that \(\int _{0}^{\infty } u\nu ^{2}(u)du <1\). Therefore, − log(C(M,x,w))>0. Then,
When M>24.85, we have that p(0.5)>20. Then, when γ<e^{20}, we have that p(0.5)− logγ>0.
Next, we prove that we can find some x_{0}∈(0.5,2) such that p(x_{0})− logγ<0 provided that γ>e^{5}. Since \(\phi \left (\frac {u}{2} \right) < 0.5\) and
for any u>0. We have that
Then, we have that for any u>0,
We define that x_{0} is the solution to the following equation:
Then, we have that
where the second inequality is due to the fact that the upper bound for the integral interval is 1 and the third inequality is due to the fact that exp(−x) is a convex function. Therefore, we have that
Note that the upper bound above for − logC(M,x_{0},w) is not dependent on M, which is because we choose a x_{0} that depends on M. Solving the Eq. (52), we have that
and x_{0}→2 as M→∞. By Taylor’s expansion, we have that x_{0}=2−2M^{−1/2}+M^{−1}+o(M^{−1}), or x_{0}=2−2M^{−1/2}+o(M^{−1/2}). Then, we have that
and
and
Combining the above results, we have that
One important observation is that the righthand side of (53) converges as M→∞. In fact, p(x_{0}) as a function of M is decreasing and converges as M→∞. Since we set w≥100, then for any M>24.85, p(x_{0})<5. Therefore, for any γ>e^{5} and any M>24.85, we can find a x_{0} close to 2 such that p(x_{0})− logγ<0.
Since p(x) is a continuous function, there exists a solution x^{∗}∈(0.5,2) such that Eq. (50) holds. □
Proof of Theorem 3
The proof uses a similar argument as that in [15].
By law of large number, when t−k tends to infinity, the following sum converges in probability
Moreover, from central limit theorem,
So by continuous mapping theorem,
i.e., the squared and scaled version of the sum is asymptotically a \({\chi ^{2}_{1}}\) random variable with one degree of freedom. By Slutsky’s theorem, combining (54) and (56),
Using Lemma 1 in [50], for \(X\sim {\chi ^{2}_{1}}\),
Therefore, with probability at least 1−2e^{−ε}, the difference is bounded by a constant
On the hand, by central limit theorem, when t−k tends to infinity,
and by law of large number and continuous mapping theorem
Hence, invoking Slutsky’s theorem again, we have
Hence, combining the above, by a triangle inequality type of argument, we may conclude that, with high probability, the difference is bounded by a constant c
Hence, to control the ARL for the procedure defined in (21)
one can approximately consider another procedure
with
This corresponds to the special case of the mixture procedure with N sensors and all being affected by the change (p_{0}=1), except that the threshold is scaled by 1/r. Hence, we can use the ARL approximation for mixture procedure, which leads to (26). □
\thelikesection Appendix 2: Justification for EDD of (27)
Proof
Below, let \(T=T^{\prime }_{\{0,1\}}\) for simplicity. Define \(S_{n,t} = \sum _{i=1}^{t} [\boldsymbol {x}_{i}]_{n} \xi _{ni}\) for any n and t. To obtain an EDD approximation to \(T^{\prime }_{\{0,1\}}\), first we note that
Then, we can leverage a similar proof as that to Theorem 2 in [15] to obtain that as b→∞,
The first term on the righthand side of (59) is equal to \(\mathbb {E}^{0} \left \{T^{\prime }_{\{0, 1\}}\right \} \cdot r\sum _{n=1}^{N} {\mu _{n}^{2}} /2\). Using the fact that random variables \(([x_{i}]_{n} \xi _{ni}  r\mu _{n})/\sqrt {r}\) are i.i.d. with mean zero and unit variance, together with the AnscombeDoeblin Lemma [36], we have that as b→∞, the second term on the righthand side of (59) is equal to N/2+o(1). The third term can be shown to be small similar to [15]. Finally, ignoring the overshoot of the detection statistic exceeding the detection threshold, we can replace the lefthand side of (59) with b. Solving the equation, we obtain the first order approximation of the EDD is given by (27). □
Notes
 1.
The video can be found at http://nislab.ee.duke.edu/MOUSSE/. The Solar Object Locator for the original data is SOL20110430T214549L061C108.
Abbreviations
 ARL:

Average run length
 CUSUM:

Cummulative sum
 EDD:

Expected detection delay
 EWMA:

Exponential weighting moving average
 GLR:

Generalized likelihood ratio
 KL:

KullbackLeibler
 PCA:

Principal component analysis
 SPC:

Statistical control charts
 SVD:

Singular value decomposition
References
 1
Y. Xie, M. Wang, A. Thompson, in Global Conference on Signal and Information Processing (GlobalSIP). Sketching for sequential changepoint detection (Orlando, 2015), pp. 78–82.
 2
A. Tartakovsky, I. Nikiforov, M. Basseville, Sequential analysis: Hypothesis Testing and Changepoint Detection (Chapman and Hall/CRC, 2014).
 3
HV. Poor, O. Hadjiliadis, Quickest detection (Cambridge University Press, Cambridge, 2008).
 4
D. P. Woodruff, Sketching as a tool for numerical linear algebra. Found. Trends. Theor. Comput. Sci.10:, 1–157 (2014).
 5
G. Dasarathy, P. Shah, B. N. Bhaskar, R. Nowak, in Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on. Covariance sketching (IEEEMonticello, 2012), pp. 1026–1033.
 6
Y. Chi, in Signal Processing Conference (EUSIPCO), 2016 24th European. Kronecker covariance sketching for spatialtemporal data (IEEEBudapest, 2016), pp. 316–320.
 7
Y. Wang, H. Y. Tung, A. J. Smola, A. Anandkumar, in Advances in Neural Information Processing Systems. Fast and guaranteed tensor decomposition via sketching (Montreal, 2015), pp. 991–999.
 8
A. Alaoui, M. W. Mahoney, in Advances in Neural Information Processing Systems. Fast randomized kernel ridge regression with statistical guarantees (Montreal, 2015), pp. 775–783.
 9
Y. Bachrach, R. Herbrich, E. Porat, in International Symposium on String Processing and Information Retrieval. Sketching algorithms for approximating rank correlations in collaborative filtering systems (SpringerNew York, 2009), pp. 344–352.
 10
G. Raskutti, M. Mahoney, in International Conference on Machine Learning. Statistical and algorithmic perspectives on randomized sketching for ordinary leastsquares (Lille, 2015), pp. 617–625.
 11
P. Indyk, in Proceedings of the nineteenth annual ACMSIAM symposium on Discrete algorithms Society for Industrial and Applied Mathematics. Explicit constructions for compressed sensing of sparse signals (San Francisco, 2008), pp. 30–33.
 12
D. Siegmund, B. Yakir, N. R. Zhang, Detecting simultaneous variant intervals in aligned sequences. Ann. Appl. Stat.5(2A), 645–668 (2011).
 13
G. K. Atia, Change detection with compressive measurements. Sig. Process. Lett. IEEE.22(2), 182–186 (2015).
 14
D. Siegmund, E. S. Venkatraman, Using the generalized likelihood ratio statistic for sequential detection of a changepoint. Ann. Stat.23(1), 255–271 (1995).
 15
Y. Xie, D. Siegmund, Sequential multisensor changepoint detection. Ann. Stat.41(2), 670–692 (2013).
 16
G. C. Runger, Projections and the Usquared multivariate control chart. J. Qual. Technol.28(3), 313–319 (1996).
 17
O. Bodnar, W. Schmid, Multivariate control charts based on a projection approach. Allg. Stat. Arch.89:, 75–93 (2005).
 18
E. SkubałlskaRafajlowicz, Random projections and Hotelling’s T2 statistics for change detection in highdimensional data analysis. Int. J. Appl. Math. Comput. Sci.23(2), 447–461 (2013).
 19
E. SkubałskaRafajlowicz, in Stochastic models, statistics and their applications, 122. Changepoint detection of the mean vector with fewer observations than the dimension using instanenous normal random projections (Springer Proc. Math. StatNew York, 2015).
 20
D. C. Montgomery, Introduction to statistical quality control (Wiley, 2008).
 21
M. A. Davenport, P. T. Boufounos, M. B. Wakin, R. G. Baraniuk, Signal processing with compressive measurements. Sel. Top. Sig. Process. IEEE. J.4(2), 445–460 (2010).
 22
E. AriasCastro, et al., Detecting a vector based on linear measurements. Electron. J. Stat.6:, 547–558 (2012).
 23
J. Geng, W. Xu, L. Lai, in Information Theory (ISIT), 2013 IEEE International Symposium on, Istanbul, Turkey. Quickest search over multiple sequences with mixed observations, (2013), pp. 2582–2586.
 24
W. Xu, L. Lai, in Communication, Control, and Computing (Allerton), 2013 51st Annual Allerton Conference on. Compressed hypothesis testing: to mix or not to mix? (IEEEMonticello, 2013), pp. 1443–1449.
 25
Z. Harchaoui, E. Moulines, F. R. Bach, in Advances in Neural Information Processing Systems. Kernel changepoint analysis (Vancouver, 2009), pp. 609–616.
 26
S. Arlot, A. Celisse, Z. Harchaoui, Kernel changepoint detection (2012). arXiv preprint arXiv:1202.3878.
 27
Y. C. Chen, T. Banerjee, A. D. DomínguezGarcía, V. V. Veeravalli, Quickest line outage detection and identification. Power. Syst. IEEE. Trans.31(1), 749–758 (2016).
 28
D. Mishin, K. BrantnerMagee, F. Czako, A. S. Szalay, in High Performance Extreme Computing Conference (HPEC), 2014 IEEE. Real time change point detection by incremental PCA in large scale sensor data (IEEEWaltham, 2014), pp. 1–6.
 29
F. Tsung, K. Wang, in Frontier in Statistical Quality Control, pp. 19–35. Adaptive charting techniques: literature review and extensions (SpringerVerlagNew York, 2010).
 30
J. Chen, S. H. Kim, Y. Xie, S3T: an efficient scorestatistic for spatiotemporal surveillance (2017). arXiv:1706.05331.
 31
W. Xu, B. Hassibi, in Info. Theory Workshop. Efficient compressive sensing with deterministic guarantees using expander graphs, (2007).
 32
Y. Chen, C. Suh, AJ Goldsmith, in Information Theory (ISIT), 2015 IEEE International Symposium on, Hong Kong. Information recovery from pairwise measurements: a Shannontheoretic approach (IEEEHong Kong, 2015), pp. 2336–2340.
 33
A. K Massimino, M. A. Davenport, in Proc. Workshop on Signal Processing with Adaptive Sparse Structured Representations (SPARS). Onebit matrix completion for pairwise comparison matrices (Lausanne, 2013).
 34
J. E. Jackson, A user’s guide to principle components (Wiley, New York, 1991).
 35
L. Balzano, S. J. Wright, Local convergence of an algorithm for subspace identification from partial data. Found. Comput. Math.15(5), 1279–1314 (2015).
 36
D. Siegmund, Sequential Analysis: Test and Confidence Intervals (Springer, New York, 1985).
 37
D. Siegmund, B. Yakir, The statistics of gene mapping (Springer, New York, 2007).
 38
H. Ruben, The volume of an isotropic random parallelotope. J. Appl. Probab.16(1), 84–94 (1979).
 39
P. Frankl, H. Maehara, Some geometric applications of the beta distribution. Ann. Inst. Statist. Math.42(3), 463–474 (1990).
 40
D. Burshtein, G. Miller, Expander graph arguments for messagepassing algorithms. IEEE Trans. Inf. Theory.47(2), 782 –790 (2001).
 41
E. J. Candes, The restricted isometry property and its implications for compressed sensing. Compte Rendus de l’Academie des Sciences, Paris, Serie I. 342:, 589–592 (2008).
 42
W. H. Woodall, M. M. Ncube, Multivariate CUSUM qualitycontrol procedures. Technometrics. 27(3), 285–292 (1985).
 43
Y. Xie, J. Huang, R. Willett, Changepoint detection for highdimensional time series with missing data. Sel. Top. Sig. Process. IEEE. J.7(1), 12–27 (2013).
 44
D. J. Watts, S. H. Strogatz, Collective dynamics of ‘smallworld’ networks. Nature. 393(6684), 440–442 (1998).
 45
A. Abur, A. G. Exposito, Power system state estimation: Theory and Implementation (CRC Press, 2004).
 46
J. Fan, Y. Liao, H. Liu, An overview on the estimation of large covariance and precision matrices. Econ. J.19(1), C1–C32 (2016).
 47
M. Kapralov, V. K. Potluru, D. P. Woodruff, in International Conference on Machine Learning. How to fake multiply by a Gaussian matrix (New York, 2016).
 48
M. Abramowitz, I. Stegun, A handbook of mathematical functions, with formulas, graphs and mathematical tables, 10th (Dover, New York, 1964).
 49
A. Thompson, Quantitative analysis of algorithms for compressed signal recovery. PhD thesis, School of Mathematics, University of Edinburgh (2012).
 50
B. Laurent, P. Massart, Adaptive estimation of a quadratic functional by model selection. Ann. Stat.28(5), 1302–1338 (2000).
Acknowledgements
We want to thank the anonymous reviewers for their excellent comments, which help greatly improve the paper. We also want to thank Minghe Zhang for the help with some revisions.
Funding
This work is partially supported by NSF grants CCF1442635 and CMMI1538746 and an NSF CAREER Award CCF1650913.
Author information
Affiliations
Contributions
We present sequential changepoint detection procedures based on linear sketches of highdimensional signal vectors using generalized likelihood ratio (GLR) statistics. Rigorous theoretical analysis and numerical examples on simulated and real data are also presented. YX came up with original formulation and analysis. YC performed numerical examples and additional theoretical analysis. AT and MW also contributed to theoretical analysis. All authors read and approved the final manuscript.
Corresponding author
Correspondence to Yao Xie.
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This paper was presented [in part] at the GlobalSIP 2015 \citexie2015sketching.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Received
Accepted
Published
DOI
Keywords
 Changepoint detection
 Streaming data
 Sketching
 Anomaly detection
 Statistical signal processing