Group lassoing changepoints in piecewiseconstant AR processes
 Daniele Angelosante^{1}Email author and
 Georgios B Giannakis^{2}
https://doi.org/10.1186/16876180201270
© Angelosante and Giannakis; licensee Springer. 2012
Received: 31 October 2011
Accepted: 21 March 2012
Published: 21 March 2012
Abstract
Regularizing the leastsquares criterion with the total number of coefficient changes, it is possible to estimate timevarying (TV) autoregressive (AR) models with piecewiseconstant coefficients. Such models emerge in various applications including speech segmentation, biomedical signal processing, and geophysics. To cope with the inherent lack of continuity and the high computational burden when dealing with highdimensional data sets, this article introduces a convex regularization approach enabling efficient and continuous estimation of TVAR models. To this end, the problem is cast as a sparse regression one with grouped variables, and is solved by resorting to the group leastabsolute shrinkage and selection operator (Lasso). The fresh look advocated here permeates benefits from advances in variable selection and compressive sampling to signal segmentation. An efficient blockcoordinate descent algorithm is developed to implement the novel segmentation method. Issues regarding regularization and uniqueness of the solution are also discussed. Finally, an alternative segmentation technique is introduced to improve the detection of change instants. Numerical tests using synthetic and real data corroborate the merits of the developed segmentation techniques in identifying piecewiseconstant TVAR models.
Keywords
1. Introduction
Autoregressive (AR) models have been the workhorse for parametric spectral estimation since they form a dense set in the class of continuous spectra and, in many cases, they approximate parsimoniously the spectrum of a given random process [1], Chap. 3]. These are among the main reasons why AR models have been widely adopted in various applications as diverse as speech modeling [2–5], electroencephalogram (EEG) signal analysis [6], and geophysics [7]. While AR modeling of stationary random processes is well appreciated, a number of signals encountered in real life are nonstationary. This justifies the growing interest toward nonstationary signal analysis and timevarying (TV) AR models, which arise naturally in speech analysis due to the changing shape of the vocal tract as well as in EEG signal analysis due to the changes in the electrical activity of neurons. If the TVAR coefficient trajectories can be well approximated by superimposing a small number of basis sequences, nonstationary modeling reduces to estimating via, e.g., leastsquares (LS), the basis expansion coefficients [7]. On the other hand, it has been welldocumented that piecewiseconstant AR systems excited by white Gaussian noise are capable of modeling realworld signals such as speech and EEG [6–8]. Piecewiseconstant AR models constitute a subset of TVAR models wherein AR coefficients change abruptly. In this case, basis expansion techniques fall short in estimating the change points [8].
Exploiting the piecewise constancy of TVAR models, several methods are available to detect the changing instants of the AR coefficients, and thus facilitate what is often referred to as signal segmentation. The literature on signal segmentation is large since the topic is of interest in signal processing, applied statistics, and several other branches of science and engineering. Recent advances can be mainly divided in two categories. The first class adopts regularized LS criteria in order to impose piecewiseconstant AR coefficients. To avoid "oversegmentation," the LS cost is typically regularized with the total number of changes [6]. The resulting estimator can be implemented via dynamic programming (DP), which incurs computational burden that scales quadratically with the signal dimension. For large data sets, such as those considered in speech processing, this burden refrains practitioners from applying DP to segmentation, and heuristics are pursued instead based on the generalized likelihood ratio test (GLRT), or, approximations of the maximum likelihood approach [2], [[7], p. 401], [9]. The second class of methods relies on Bayesian inference and Markov Chain Monte Carlo (MCMC) methods [3–5]. A distinct advantage of this class is that model order selection can be performed automatically, and a variable model order can be chosen per segment. However, Bayesian techniques are known to require large computational resources.
The algorithm for change detection of piecewiseconstant AR models developed in this article belongs to the first class of methods, and its first novelty consists in developing a new regularization function which encourages piecewiseconstant TVAR coefficients while being convex and continuous; hence, it can afford efficient convex optimization solvers. To this end, it is shown that the segmentation problem can be recast as a sparse regression problem. The regularization function in [6] is then relaxed with its tightest convex approximation. It turns out that the resultant change detector is a modification of the group leastabsolute shrinkage and selection operator (Lasso) [10].
With the emphasis placed on large data sets, a candidate algorithm for implementing the developed change detector is a blockcoordinate descent iteration, which is provably convergent to the group Lasso solution. Surprisingly, it turns out that each iteration of the blockcoordinate descent can be implemented at complexity that scales linearly with the signal dimension, thus encouraging its application to large data sets. Regularization tuning and uniqueness of the group Lasso solution are also discussed.
The second novelty of the present study is an alternative changepoint retrieval algorithm based on the smoothlyclipped absolute deviation (SCAD) regularization. The associated nonconvex problem is tackled by resorting to a local linear approximation (LLA), which yields iterated weighted group Lasso minimization problems that can be solved via blockcoordinate descent. Numerical tests using synthetic and real (speech and sound) data are performed to corroborate the capability of the developed algorithms to identify piecewiseconstant TVAR models.
The remainder of the article is structured as follows. Section 2 deals with piecewiseconstant TVAR model estimation preliminaries. In Section 3, the problem at hand is recast as a sparse linear regression, and the novel group Lasso approach is introduced. An efficient blockcoordinate descent algorithm is developed in Section 4, while tuning issues and uniqueness of the group Lasso solution are addressed in Section 5. Section 6 introduces a nonconvex segmentation method based on the SCAD regularization to enhance the sparsity of the solution, which translates to retrieving more precisely the change instants. Numerical tests are presented in Section 7, and concluding remarks are summarized in Section 8. The Appendix is devoted to technical proofs. Notation: Column vectors (matrices) are denoted using lowercase (uppercase) boldface letters; calligraphic letters are reserved for sets; (·)^{ T }stands for transposition, $\mathcal{N}\left(\mu ,{\sigma}^{2}\right)$ denotes the Gaussian probability density function with mean μ and variance σ^{2}; ⊗ denotes the Kronecker product; 0_{ L }is the Ldimensional column vector with all zeros, and I_{ L }is the Ldimensional identity matrix. The ℓ_{ p } norm of $\mathsf{\text{x}}:={\left[{x}_{1},\dots ,{x}_{L}\right]}^{T}\in {\mathbb{R}}^{L}$ is defined as ${\u2225\mathsf{\text{x}}\u2225}_{p}:={\left({\sum}_{l=1}^{L}{\left{x}_{l}\right}^{p}\right)}^{\frac{1}{p}}.$.
2. Preliminaries and problem statement
for k = 0,1,..., K, where K denotes the number of abrupt changes in the TVAR spectrum, and n_{ k } the time instant of the k th abrupt change. The interval [n_{ k } , n_{k+1} 1] is referred to as the kth segment. Without loss of generality, n_{0} = 0 and n_{K+1} 1 = N.
The goal is to estimate the instants ${\left\{{n}_{k}\right\}}_{k=1}^{K}$ where the given time series {y_{ n } } is split into K + 1 (stationary) segments, and also the constant AR coefficients per segment, i.e., ${\left\{{a}_{k}\right\}}_{k=0}^{K}$. The number of abrupt changes, namely K, is not necessarily known.
2.1. Optimum segmentation of TVAR processes
From a practical point of view, the minimization in (4) or (6) is challenging since an exhaustive search over all possible sets of change instants has to be performed. However, several techniques based on DP, simulated annealing and interactive conditional model algorithms have been developed to evaluate (4) [6, 16]. Despite the fact that DP approaches solve (4) in polynomial time, the computational complexity is quadratic in N, which limits their applicability to signal segmentation in practice. In typical applications, N can be very large (up to several thousands), and even quadratic complexity cannot be afforded. On the other hand, when applied to real data, the performance of the estimator in (4) is not satisfactory [17].
To overcome these limitations of (4), heuristic approaches based on the GLRT are used in real world applications [[7], p. 401], [9, 18, 19]. However, GLRTbased change detectors are sensitive to modeling errors, and require fine tuning of the associated detection thresholds.
In what follows, a convex relaxation of the cost in (4) is advocated based on recent advances in sparse linear regression and compressive sampling. To this end, (4) is first reformulated to a sparse regression problem with nonconvex regularization that is successively relaxed through its tightest convex approximation. The consequent optimization rule will yield sparse vector estimators which result in surprisingly accurate retrieval of changepoints. Those are obtained by an efficient blockcoordinate descent iteration that incurs only linear computational burden and memory storage. Unlike (4), based on wellestablished results in statistics, it will be further argued that the resultant TVAR model estimates are a continuous function of the data.
3. Sparse linear regression and group lassoing
What makes the formulation in (11) attractive but also challenging is the nonconvex and discontinuous Schwarzlike regularization term. The latter "pushes" most of the ${\left\{{\mathsf{\text{d}}}_{n}\right\}}_{n=1}^{N}$ vectors toward 0_{ L }, while d_{0} is not penalized. As a consequence, the vector $\widehat{\mathsf{\text{d}}}:={\left[{\widehat{\mathsf{\text{d}}}}_{0}^{T},{\widehat{\mathsf{\text{d}}}}_{1}^{T},\dots ,{\widehat{\mathsf{\text{d}}}}_{N}^{T}\right]}^{T}$ is group sparse, and the nonzero group indexes correspond to the change instants of the TVAR coefficients. Recently, a convex model selector with grouped variables was put forth by [10], and successfully applied to biostatistics and compressive sampling [20]. It generalizes the (nongrouped) leastabsolute shrinkage and selection operator (Lasso) [21] to regression problems where the unknown vector exhibits sparsity in groups; hence, its name group Lasso. The crux of group Lasso is to relax the Schwarzlike regularization in (11) with its tightest convex approximation.
where λ is a positive tuning parameter. It is known that the group Lasso regularization encourages group sparsity; that is, ${\widehat{\mathsf{\text{d}}}}_{n}={0}_{L}$ for most n > 0 [10]. Again, the larger the λ, the sparser the $\widehat{\mathsf{\text{d}}}$.
Remark 1. Different from Schwarzlike regularization, Figure 1 illustrates that the group Lasso one grows unbounded. This makes the resultant estimator biased. Nevertheless, unlike the Schwarzlike regularization, the group Lasso one is continuous which renders the resulting estimator more stable when applied to real data; see also [10, 22]. A continuous regularization function that reduces the bias of the group Lasso will be discussed in Section 6.
Remark 2. Convex relaxation for detecting changes in the mean of nonstationary processes was recently mentioned in [12], and analyzed in [17]. For the meanchange problem, the tightest convex approximation of the Schwarzlike regularized LS is provided by the Lasso, which can afford efficient solvers such as the leastangle regression (LARS) algorithm [23]. However, for the group Lasso cost proposed here to catch changes in TVAR models, an exact LARSlike solver is not available [10]; thus, the pursuit of efficient algorithms for solving (12) is well motivated. This is the theme of the ensuing section.
4. Blockcoordinate descent solver
Notice that if ${\u2225{\mathsf{\text{g}}}_{n}^{\left(i\right)}\u2225}_{2}\le \lambda $, the solution of (18) is ${\mathsf{\text{d}}}_{n}^{\left(i\right)}={0}_{L}$. Since it is expected that the solution of (12) is sparse, solving (18) is trivial most of the time. If ${\u2225{\mathsf{\text{g}}}_{n}^{\left(i\right)}\u2225}_{2}>\lambda ,{\mathsf{\text{d}}}_{n}^{\left(i\right)}$ can be obtained via interior point methods or by (numerically) solving the scalar equation in (21), which admits fast solvers via, e.g., NewtonRaphson iterations, as in [25].
The blockcoordinate descent algorithm is summarized in Algorithm 1. Interestingly, matrix $\mathsf{\text{X}}\in {\mathbb{R}}^{\left(N+1\right)\times \left(N+1\right)L}$ in (12) does not have to be stored since only ${\left\{{\mathsf{\text{R}}}_{n:N}\right\}}_{n=0}^{N}$ and ${\left\{{\mathsf{\text{r}}}_{n:N}\right\}}_{n=0}^{N}$ suffice to implement Algorithm 1. Thus, the memory storage and complexity to perform one blockcoordinate descent iteration grow linearly with N. This attribute renders the blockcoordinate descent appealing especially for largesize problems where DP approaches tend to be too expensive.
Regarding convergence, the ensuing assertion is a direct consequence of the results in [26].
Proposition 1. The iterates ${\mathsf{\text{d}}}^{\left(i\right)}:={\left[{\mathsf{\text{d}}}_{0}^{{\left(i\right)}^{T}},{\mathsf{\text{d}}}_{1}^{{\left(i\right)}^{T}},\dots ,{\mathsf{\text{d}}}_{N}^{{\left(i\right)}^{T}}\right]}^{T}$ obtained by Algorithm 1 converge to the global minimum of (12); that is, $\underset{i\to \infty}{\mathsf{\text{lim}}}{\mathsf{\text{d}}}^{\left(i\right)}=\widehat{\mathsf{\text{d}}}$.
Blockcoordinate descent will also be the basic building block for solving the nonconvex problem introduced in Section 6 to improve the retrieval of changepoints. But first, it is useful to consider two issues of the group Lasso change detector for TVAR models.
Given ${\left\{{\mathsf{\text{R}}}_{n:N},{\mathsf{\text{r}}}_{n:N}\right\}}_{n=0}^{N}$
Initialize with ${\mathsf{\text{d}}}_{n}^{\left(0\right)}={0}_{L}$ for n = 1, ... , N
for i > 0 do
for n = 0,1,..., N do
if n = 0 then
${\mathsf{\text{c}}}_{0}^{\left(i\right)}={0}_{L}$
${\mathsf{\text{s}}}_{0}^{\left(i\right)}={\sum}_{n=1}^{N}{\mathsf{\text{R}}}_{n:N}{\mathsf{\text{d}}}_{n1}^{\left(i1\right)}$
${\mathsf{\text{g}}}_{0}^{\left(i\right)}={\mathsf{\text{s}}}_{0}^{\left(i\right)}{\mathsf{\text{r}}}_{0:N}$
${\mathsf{\text{d}}}_{\mathsf{\text{0}}}^{\left(i\right)}={\mathsf{\text{R}}}_{0:N}^{1}{\mathsf{\text{g}}}_{0}^{\left(i\right)}$
else
${\mathsf{\text{c}}}_{n}^{\left(i\right)}={\mathsf{\text{c}}}_{n1}^{\left(i\right)}+{\mathsf{\text{d}}}_{n1}^{\left(i\right)}$
${\mathsf{\text{s}}}_{n}^{\left(i\right)}={\mathsf{\text{s}}}_{n1}^{\left(i\right)}{\mathsf{\text{R}}}_{n:N}{\mathsf{\text{d}}}_{n}^{\left(i1\right)}$
${\mathsf{\text{g}}}_{n}^{\left(i\right)}={\mathsf{\text{R}}}_{n:N}{\mathsf{\text{c}}}_{n}^{\left(i\right)}+{\mathsf{\text{s}}}_{n}^{\left(i\right)}{\mathsf{\text{r}}}_{n:N}$
if${\u2225{\mathsf{\text{g}}}_{n}^{\left(i\right)}\u2225}_{2}\le \lambda $then
${\mathsf{\text{d}}}_{n}^{\left(i\right)}={0}_{L}$
else
${\mathsf{\text{d}}}_{n}^{\left(i\right)}=\mathsf{\text{arg}}\mathsf{\text{mi}}{\mathsf{\text{n}}}_{{\mathsf{\text{d}}}_{n}\in {\mathbb{R}}^{L}}\left[\frac{1}{2}{\mathsf{\text{d}}}_{n}^{T}{\mathsf{\text{R}}}_{n:N}{\mathsf{\text{d}}}_{n}+{\mathsf{\text{d}}}_{n}^{T}{\mathsf{\text{g}}}_{n}^{\left(i\right)}+\lambda {\u2225{\mathsf{\text{d}}}_{n}\u2225}_{2}\right]$
Algorithm 1: Blockcoordinate descent algorithm
5. Regularization and uniqueness issues
Performance of model selection with grouped variables via group Lasso and related approaches has been analyzed in [10, 20, 27], while asymptotic analysis has been pursued in [28, 29]. In this section, a couple of issues are investigated regarding the group Lasso cost function, and the uniqueness of its minimum.
5.1. Tuning the regularization parameter
Selection of λ is a critical issue since larger λ's promote sparser solutions, which translate to fewer changes in the TVAR spectrum. However, larger λ's increase the estimator bias as well. If the number of changes present are known a priori by other means, or, if a certain level of segmentation can be afforded, λ can be tuned accordingly by 'trial and error,' or by crossvalidation. But in general, analytic methods to automatically choose the "best" value of λ are not available. In essence, selecting the regularization parameters is more a matter of engineering art, rather than systematic science.
In this section, heuristic but useful guidelines are provided to choose λ based on rigorously established lower bounds of this parameter. Given $\mathsf{\text{X}}\in {\mathbb{R}}^{N+1\times \left(N+1\right)L}$ in (10), define ${\mathsf{\text{X}}}_{n}\in {\mathbb{R}}^{N+1\times L},n=0,1,\dots ,N$ such that X = [X_{0}, X_{1},..., X_{ N }]. To bound λ, we will rely on the following result; see Appendix 1 for the proof.
Proposition 2. If X_{0}has full column rank, then$\widehat{\mathsf{\text{d}}}={\left[\mathsf{\text{d}}{}_{0,c}^{T},{0}_{L}^{T},\dots ,{0}_{L}^{T}\right]}^{T}$with$\mathsf{\text{d}}0,c:={\left({\text{X}}_{0}^{T}{\text{X}}_{0}\right)}^{1}{\text{X}}_{0}^{T}\mathsf{\text{y}}\phantom{\rule{1em}{0ex}}$, if and only if$\lambda \ge {\lambda}^{*}:\underset{n=1,\dots ,N}{\mathsf{\text{max}}}{\u2225{\text{X}}_{n}^{T}\left({\text{X}}_{0}\mathsf{\text{d}}0,c\mathsf{\text{y}}\phantom{\rule{1em}{0ex}}\right)\u2225}_{2}$
If λ exceeds a threshold, which is specified by the regression matrix and the observations, Proposition 2 asserts that ${\widehat{\mathsf{\text{d}}}}_{0}=\mathsf{\text{d}}0,c$ and ${\widehat{\mathsf{\text{d}}}}_{n}={0}_{L}$ for n = 1,..., N. This, along with (9), implies that ${\widehat{\mathsf{\text{a}}}}_{n}=\mathsf{\text{d}}0,c$ for n = 0, 1,..., N; that is, no change occurs in the coefficients of the TVAR process. To avoid this trivial (changefree) solution, the guideline provided by Proposition 2 is that λ must be chosen strictly less than λ*. Our extensive simulations suggest that setting λ equal to a small percentage of λ*, say 520%, results in satisfactory estimates.
5.2. Uniqueness of the sparse solution
Uniqueness of sparse linear regression with nongrouped variables has been studied in [30–32]. Next, uniqueness issues in recovering sparse vectors with groupvariables are explored by exploiting the deterministic structure of the regression matrix in (12). The cost function in (12) is not strictly convex since X is a fat matrix, and the regularization term is not strictly convex; see also Figure 1. On the other hand, the blockcoordinate descent algorithm developed in Section 4 is guaranteed to converge to a global minimum. In the following, a criterion is introduced to check a posteriori whether the obtained solution is unique for a given groupsparsity level.
Traditionally, the support of a sparse vector is defined as the set of indexes corresponding to the nonzero entries. In the groupsparsity framework herein, a different definition of support is required. Indeed, the vector of interest here, namely $\mathsf{\text{d}}={\left[\mathsf{\text{d}}{}_{0}^{T},\mathsf{\text{d}}{}_{1}^{T},\dots ,\mathsf{\text{d}}{}_{N}^{T}\right]}^{T}$ comprises N + 1 groups of Ldimensional variables. Since the term d0 is not penalized in (12), ${\widehat{\mathsf{\text{d}}}}_{0}\ne {0}_{L}$ almost surely. Define the group support (gsupport) of $\widehat{\mathsf{\text{d}}}$ to be the set containing the indexes relative to the nonzero group of variables; that is, $\mathsf{\text{gsupp}}\left(\widehat{\mathsf{\text{d}}}\right):=\left\{n\in \left\{1,\dots ,N\right\}:{\widehat{\mathsf{\text{d}}}}_{n}\ne {0}_{L}\right\}$ In the following, when $\mathcal{G}:=\left\{{s}_{1},\dots ,{s}_{\left\mathcal{G}\right}\right\}\subset \left\{1,\dots ,N\right\}$ denotes the gsupport of d, the set is assumed ordered; i.e., s_{ j } < s_{ k } for each j < k.
The following lemma establishes a property of the matrix X in (10); see Appendix 2 for the proof.
Lemma 1. If any L out of N + 1 vectors${\left\{\mathsf{\text{h}}n\right\}}_{n=0}^{N}$are linearly independent, for any gsupport$\mathcal{G}=\left\{{s}_{1},\dots ,{s}_{\left\mathcal{G}\right}\right\}$such that$\left(\left\mathcal{G}\right+1\right)L\le N+1,{s}_{1}\ge L,{s}_{\left\mathcal{G}\right}\le NL+1$, and s_{ j } s_{ k } ≥ L for each j ≠ k, the matrix${\mathsf{\text{X}}}_{\mathcal{G}}:=\left[{\mathsf{\text{X}}}_{0},{\mathsf{\text{X}}}_{s1,\dots ,}{\mathsf{\text{X}}}_{{s}_{\left\mathcal{G}\right}}\right]\in {\mathbb{R}}^{N+1\times \left(\left\mathcal{G}\right+1\right)L}$has full column rank.
Lemma 1 asserts full column rank of the submatrix ${\text{X}}_{\mathcal{G}}$, if it is formed by the columns of X corresponding to the nonzero indexes of any sparse vector whose gsupport is sufficiently small, and the nonzero groups are sufficiently distant from each other.
Next, Lemma 1 is exploited to establish an interesting property for the solutions of (12); see Appendix 3 for the proof.
Proposition 3. If any L out of N+1 vectors${\left\{\text{h}n\right\}}_{n=0}^{N}$are linearly independent, for any gsupport$\mathcal{G}=\left\{{s}_{1},\dots ,{s}_{\left\mathcal{G}\right}\right\}\subset \left\{1,\dots ,N\right\}$such that$\left(\left\mathcal{G}\right+1\right)L\le N+1,{s}_{1}\ge L,{s}_{\left\mathcal{G}\right}\le NL+1$, and s_{ j }  s_{ k } ≥ L for each j ≠ k, there exists at most one solution of (12) gsupported in .
Proposition 3 ensures that if $\widehat{\text{d}}$ is gsupported in , and is sufficiently sparse with nonzero groups sufficiently far apart, then $\widehat{\text{d}}$ is the only solution of (12) gsupported in .
Remark 3. Analysis of the group Lasso and its modifications has revealed that its performance can be close to the Schwarzregularized LS either when the regression matrix is sufficiently block incoherent, or, when the block restricted isometry property holds [10, 27]. If the regression matrix can be chosen by the designer, and it is randomly drawn from selected distributions (e.g., Gaussian or Bernoulli), these analyzes provide useful connections between problems (11) and (12). In the problem at hand however, the regression matrix is fixed, and its blocks [X_{0}, X_{1},..., X_{ N }] are highly correlated. In this case, the relationship between the solutions of (11) and (12) is much less understood, and constitutes an interesting future research direction.
6. Continuity, bias and the group SCAD
As already pointed out, convex relaxation of the Schwarzlike regularization was developed in [12, 17] for the meanchange problem using the (nongrouped) Lasso. Numerical results in [17] reveal that the Lasso tends to detect a "cloud" of small change points around an actual change. Postprocessing via DP to select a few of the estimated change instants was proposed in [17]. Moreover, due to the bias introduced by the Lasso, once the change points are obtained, another step is required to reestimate the mean within a segment. In the following, a novel change detector is developed based on recent advances in model selection via nonconvex regularization. The resulting estimator reduces the bias of group Lasso and can afford a convergent optimization solver. The corresponding algorithm is based on iterative instantiations of weighted group Lasso, which is capable of enhancing the sparsity of the solution [33, 34], and thus improving the precision of the detected change points.
Attributes of a "good" regularization function are delineated in [22], and three properties are identified to this end:

Unbiasedness. The estimator has to be unbiased when the true unknown parameter has large amplitude.

Sparsity. The estimator has to set smallamplitude coefficients to zero to reduce the number of variables.

Continuity. The estimator has to be continuous in the data to avoid instability when estimating (non)zero variables.
The data dependence of ${\widehat{d}}^{\text{SCAD}}$ is depicted in Figure 2c for λ = 2 and a = 3.7. Observe that the SCAD enjoys the three aforementioned attributes of a desirable regularization function.
Remark 4. In principle, one could also apply a blockcoordinate descent iteration to minimize (34) directly. The resulting iterates converge to a local minimum of (34) that depends on the starting points; see also [26]. In general, it is impossible to assess properties of this solution. Instead, the solution of the LLA in (37) has provable merits in estimating the true support of sparse signals [34].
Remark 5. Recently, greedy algorithms such as the matching pursuit and the orthogonal matching pursuit have been shown to approach the performance of (11) and (12) when the regression matrix is sufficiently block incoherent, or, when the block restricted isometry property holds [27]. In the problem at hand, wherein the regression matrix exhibits correlation among blocks, simulated tests have shown that greedy algorithms suffer from severe error propagation. In fact, a cloud of change points is typically declared around a true change point. For these reasons, greedy algorithms will not be considered hereafter.
The cost function in (38) is convex and effects a piecewiseconstant and sparse TVAR model. It is different from model order selection criteria in the sense that the selected nonzero AR coefficients do not necessarily have to be consecutive. A challenge associated with the optimization in (38) is that blockcoordinate descent algorithms do not converge, since the differentiable part is not separable groupwise [26]. The cost function in (38) resembles the fused Lasso developed in [35]. Efficient algorithms exploiting this link, and the structure of the problem at hand are currently under investigation.
7. Simulated tests
The merits of the novel approaches to catching changepoints in TVAR processes are assessed via numerical simulations using synthetic and real data.
7.1. Synthetic data
7.2. Real data: piano sound
7.3. Real data: speech
Change instants detected by the group SCAD and GLR algorithm
Group SCAD  799  1217  1570  1742  2017  2359  2814  3668 

GLR  445  645  1550  1750  2151  2797  3400  3626 
The first change detected by the GLR is at sample 445, while this change is not detected by the group SCAD. Interestingly, it is reported in [3] that this change is not relevant for segmentation purposes, and this fact is apparent by inspection of the true signal. Both algorithms have detected changes around samples 1750, 2100, 2800, and 3650. The group SCAD successfully removed the false change detected by the GLR at sample 3400. By inspecting the original signal, this change does not seem to be relevant. The group SCAD had detected a change instant around sample 1300 unlike the GLR. This change is also detected by advanced Bayesian techniques reported in [3–5]. The group SCAD has detected a change at sample 779, while the GLR at sample 645. Indeed, inspection of the original signal suggests that the detection of the GLR is preferable. Surprisingly, the group SCAD, unlike the GLR and the Bayesian techniques of [3–5], has detected a change at sample 2359. Observing the original signal around this point, there is a clear amplitude modulation that may cause a change in the TVAR coefficients, which existing algorithms have passed undetected.
A way to univocally compare the two algorithms is via the segmented prediction error (SPE). Assuming that K changes have been detected at instants ${\left\{{\widehat{n}}_{k}\right\}}_{k=1}^{K}$, let ${\mathit{\text{\xe2}}}_{k}$ denotes the LS estimates of the AR model in the k th segment, i.e., ${\mathit{\text{\xe2}}}_{k}=\text{arg}\underset{\text{a}\in {\mathbb{R}}^{L}}{\text{min}}{\sum}_{n={\widehat{n}}_{k}}^{{\widehat{n}}_{k+1}1}{\left({y}_{n}\text{h}n\text{a}\right)}^{2}$. The SPE is defined as $\text{SPE}:={\sum}_{k=0}^{K}{\sum}_{n=\widehat{n}}^{{\widehat{n}}_{k+1}1}{\left({y}_{n}{h}_{n}{\mathit{\text{\xe2}}}_{k}\right)}^{2}$, and represents the error in approximating the original signal {y_{ n } } with a TVAR model exhibiting abrupt changes at instants ${\left\{{\widehat{n}}_{k}\right\}}_{k=1}^{K}$. The GLR segmentation entails SPE_{glr} = 0.2638, while SPE_{g}_{scad} = 0.2578 for the group SCAD. Clearly, the group SCAD based segmentation seems preferable to that of the GLR algorithm.
8. Concluding remarks
Novel estimators were developed in this article for identification of piecewiseconstant TVAR models by exploiting recent advances in variable selection and compressive sampling. While traditional techniques consist in regularizing a LS criterion with the total number of coefficient changes, the novel estimator relies on a convex regularization function, which resembles the group Lasso and can afford efficient implementation using blockcoordinate descent iterations. The latter incurs computational burden that scales linearly with the number of data samples, thus being particularly attractive for largesize problems. Regularization tuning issues are discussed along with conditions for uniqueness of the estimated piecewiseconstant AR model. An alternative group smoothlyclipped absolute deviation regularization is also introduced, and an algorithm based on iterative weighted group Lasso minimizations is developed. Numerical tests using synthetic and real data confirm that the developed algorithms can effectively identify piecewiseconstant AR models of large size at manageable complexity, and outperform heuristic alternatives that are based on the GLRT.
Appendix 1: proof of proposition 2
 (c1)
w_{0} = 0_{ l }; and,
 (c2)
$\left\{\begin{array}{cc}\hfill \text{w}n+\lambda \frac{{\widehat{\text{d}}}_{n}}{{\u2225{\widehat{\text{d}}}_{n}\u2225}_{2}}={0}_{L},\hfill & \hfill \text{if}\phantom{\rule{2.77695pt}{0ex}}{\widehat{\text{d}}}_{n}\ne {0}_{L}\hfill \\ \hfill {\u2225\text{w}n\u2225}_{2}\le \lambda \hfill & \hfill \text{if}\phantom{\rule{2.77695pt}{0ex}}{\widehat{\text{d}}}_{n}\ne {0}_{L}\hfill \end{array}\right.\text{for}\phantom{\rule{2.77695pt}{0ex}}n=1,\dots ,N.$
The changefree solution corresponds to having ${\widehat{\text{d}}}_{n}={0}_{L}$ for n = 1,..., N. Thus, (c1) implies that $\text{X}{}_{0}^{T}\left(\text{X}0{\widehat{\text{d}}}_{0}y\right)={0}_{L}$, which is uniquely satisfied by ${\widehat{\text{d}}}_{0}=\text{d}0,c$, since X_{0} has full column rank. Hence, ${\widehat{\text{d}}}_{0}=\text{d}0,c$ and ${\widehat{\text{d}}}_{n}={0}_{L}$ for n = 1,..., N hold if and only if (c2) is satisfied, which corresponds to ║w_{ n }║_{2} ≤ λ for n = 1,..., N. Since $\text{w}n={X}_{n}^{T}\left({X}_{0}\text{d}0,c\text{y}\right)$, condition (c2) is satisfied if and only if $\lambda \ge {\lambda}^{*}:=\underset{n=1,\dots ,N}{\text{max}}{\u2225\text{X}{}_{n}^{T}\left(\text{X}0\text{d}0,cy\right)\u2225}_{2}$.
Appendix 2: proof of lemma 1
Notice that the first subsum in (42) comprises $N{s}_{\left\mathcal{G}\right}+1$ rank1 matrices, the last subsum comprises s_{1} rank1 matrices, while the g th subsum comprises ${s}_{\left\mathcal{G}\rightg+2}s\left\mathcal{G}\rightg+1$ rank1 matrices for $g=2,\dots ,\left\mathcal{G}\right$. Since is such that ${s}_{1}\ge L,{s}_{\left\mathcal{G}\right}\le NL+1$, and s_{ j }  s_{ k }  ≥ L for each j ≠ k, and any L out of N + 1 vectors ${\left\{{\text{h}}_{n}\right\}}_{n=0}^{N}$ are linearly independent, each of the summands in (42) has rank L. Thus, it is possible to find L linearly independent vectors ${\left\{{\text{h}}_{1,\ell}\right\}}_{\ell =1}^{L}\subset {\mathbb{R}}^{L}$ such that the first subsum in (42) equals to ${\sum}_{\ell =1}^{L}{\stackrel{\u0303}{\text{h}}}_{\text{1,}\ell}{\stackrel{\u0303}{\text{h}}}_{\text{1,}\ell}^{T}$ with ${\stackrel{\u0303}{\text{h}}}_{1,\ell}:={[{\text{h}}_{1,\ell}^{T},\dots ,{\text{h}}_{1,\ell}^{T}]}^{T}\in {\mathbb{R}}^{(\left\mathcal{G}\right+1)L}$. Analogously, it is possible to find L linearly independent vectors ${\left\{{\text{h}}_{g,\ell}\right\}}_{\ell =1}^{L}\subset {\mathbb{R}}^{L}$ such that the g th subsum in (42) can be written as ${\sum}_{\ell =1}^{L}{\stackrel{\u0303}{\text{h}}}_{g\text{,}\ell}{\stackrel{\u0303}{\text{h}}}_{g\text{,}\ell}^{T}$ with ${\stackrel{\u0303}{\text{h}}}_{g\text{,}\ell}:={\left[\underset{\left\mathcal{G}\rightg+2}{\underset{\u23df}{{\text{h}}_{g,\ell}^{T},\dots ,{\text{h}}_{g,\ell}^{T}}},\underset{g1}{\underset{\u23df}{{0}_{L}^{T},\dots ,{0}_{L}^{T}}}\right]}^{T}\in {\mathbb{R}}^{\left(\left\mathcal{G}\right+1\right)L}$ for $g=2,\dots ,\left\mathcal{G}\right$. Finally, it is possible to find L linearly independent vectors ${\left\{{\text{h}}_{\left\mathcal{G}\right+1,\ell}\right\}}_{\ell =1}^{L}\subset {\mathbb{R}}^{L}$ such that the last subsum in (42) can be written as ${\sum}_{\ell =1}^{L}{\stackrel{\u0303}{\text{h}}}_{\left\mathcal{G}\right+1,\ell}{\stackrel{\u0303}{\text{h}}}_{\left\mathcal{G}\right+1,\ell}^{T}$ with ${\stackrel{\u0303}{\text{h}}}_{\left\mathcal{G}\right+1,\ell}:={\left[{\text{h}}_{\left\mathcal{G}\right+1,\ell}^{T},{0}_{L}^{T},\dots ,{0}_{L}^{T}\right]}^{T}\in {\mathbb{R}}^{\left(\left\mathcal{G}\right+1\right)L}$. Thus, ${\text{X}}_{\mathcal{G}}^{\mathcal{T}}{\text{X}}_{\mathcal{G}}={\sum}_{g=1}^{\left\mathcal{G}\right+1}{\sum}_{\ell =1}^{L}{\stackrel{\u0303}{\text{h}}}_{g,\ell}{\stackrel{\u0303}{\text{h}}}_{g,\ell}^{T}$, and since $\left\{{\stackrel{\u0303}{\text{h}}}_{g,\ell}\right\}$ are (  + 1) linearly independent vectors, ${\text{X}}_{\mathcal{G}}$ has fullcolumn rank.
Appendix 3: proof of proposition 3
From Lemma 1, ${\text{X}}_{\mathcal{G}}$ has full column rank which implies that $\frac{1}{2}{\u2225\text{y}\text{X}\mathcal{G}\text{u}\u2225}_{2}^{2}$ is strictly convex, and so is $\frac{1}{2}{\u2225\text{y}\text{X}\mathcal{G}\text{u}\u2225}_{2}^{2}+\lambda {\sum}_{s=1}^{\mathcal{G}}{\u2225\text{u}s\u2225}_{2}$. Thus, (43) admits a unique solution.
Since both $\widehat{\text{d}}$ and $\widehat{\text{d}}\prime $ are gsupported in by hypothesis, and their restrictions to $\left\{0\right\}\cup \mathcal{G}$ are equal to û, it follows readily that $\widehat{\text{d}}=\widehat{\text{d}}\prime $.
Declarations
Acknowledgements
This work was supported by MURI (AFOSR FA95501010567) grant.
Authors’ Affiliations
References
 Stoica P, Moses RL: Introduction to Spectral Analysis. PrenticeHall, New Jersey; 1997.MATHGoogle Scholar
 Djuric PM: A MAP solution to offline segmentation of signals. In Proc of the International Conference on Acoustics, Speech, and Signal Processing. Volume 4. Adelaide, Australia; 1994:505508.Google Scholar
 Dobigeon N, Tourneret JY, Davy M: Joint segmentation of piecewise constant autoregressive processes by using a hierarchical model and a Bayesian sampling approach. IEEE Trans Signal Process 2007, 55(4):12511263.MathSciNetView ArticleGoogle Scholar
 Fearnhead P: Exact Bayesian curve fitting and signal segmentation. IEEE Trans Signal Process 2005, 53(6):21602166.MathSciNetView ArticleGoogle Scholar
 Punskaya E, Andrieu C, Doucet A, Fitzgerald WJ: Bayesian curve fitting using MCMC with applications to signal segmentation. IEEE Trans Signal Process 2002, 50(3):747758.View ArticleGoogle Scholar
 Lavielle M: Optimal segmentation of random processes. IEEE Trans Signal Process 1998, 46(5):13651373.View ArticleGoogle Scholar
 Basseville M, Nikiforov IV: Detection of Abrupt Changes: Theory and Application. PrenticeHall, Englewood Cliffs, NJ, USA; 1993.Google Scholar
 Hall MG, Oppenheim AV, Willsky AS: Timevarying parametric modeling of speech. Signal Process 1983, 5(3):267285.View ArticleGoogle Scholar
 Rudoy D, Quatieri TF, Wolfe PJ: Timevarying autoregressive tests for multiscale speech analysis. In Proceedings of Interspeech. Volume 1. Brighton, UK; 2009:28392842.Google Scholar
 Yuan M, Lin Y: Model selection and estimation in regression with grouped variables. J Royal Stat Soc, Ser B 2006, 68(1):4967.MathSciNetView ArticleMATHGoogle Scholar
 Brockwell PJ, Davis RA: Time Series: Theory and Methods. SpringerVerlag, New York, NY, USA; 1990.MATHGoogle Scholar
 Boysen L, Kempe A, Liebscher V, Munk A, Wittich O: Consistencies and rates of convergence of jumppenalized leastsquares estimators. Annals Stat 2009, 37(1):157183.MathSciNetView ArticleMATHGoogle Scholar
 Lavielle M: Using penalized contrasts for the changepoint problem. Signal Process 2005, 85(8):15011510.MathSciNetView ArticleMATHGoogle Scholar
 Lavielle M, Moulines E: Leastsquares estimation of an unknown number of shifts in a time series. J Time Series Anal 2000, 21(1):3359.MathSciNetView ArticleMATHGoogle Scholar
 Lebarbier E: Detecting multiple changepoints in the mean of Gaussian process by model selection. Signal Process 2005, 85(4):717736.View ArticleMATHGoogle Scholar
 Winkler G, Liebscher V: Smoothers for discontinuous signals, J. Nonparametric Stat 2002, 14(12):203222.MathSciNetView ArticleMATHGoogle Scholar
 Harchaoui Z, LevyLeduc C: Catching changepoints with Lasso. In Proceedings of the Advanced Neural Information Processes Systems. Volume 20. Vancouver, Canada; 2008:161168.Google Scholar
 Appel U, Brandt AV: Adaptive sequential segmentation of piecewise stationary time series. Inf Sci 1983, 29(1):2756.View ArticleMATHGoogle Scholar
 Willsky A, Jones H: A generalized likelihood ratio approach to the detection and estimation of jumps in linear systems. IEEE Trans Autom Control 1976, 21(1):108112.MathSciNetView ArticleMATHGoogle Scholar
 Eldar YC, Mishali M: Block sparsity and sampling over a union of subspaces. In Proc of the 16th International Conference on Digital Signal Processing. Volume 1. Santorini, Greece; 2009:18.Google Scholar
 Tibshirani R: Regression shrinkage and selection via the Lasso. J Royal Stat Soc Ser B 1996, 58(1):267288.MathSciNetMATHGoogle Scholar
 Fan J, Li R: Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 2001, 96: 13481360.MathSciNetView ArticleMATHGoogle Scholar
 Efron B, Hastie T, Johnstone I, Tibshirani R: Least angle regression. Annals Stat 2004, 32: 407499.MathSciNetView ArticleMATHGoogle Scholar
 Sturm JF: Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optim Methods Softw 1999, 1112: 625653.MathSciNetView ArticleMATHGoogle Scholar
 Puig A, Wiesel A, Hero A: A multidimensional shrinkagethresholding operator. In Proceedings of the 15th Workshop on Statistical Signal Processing. Volume 18. Cardiff, UK; 2009:363366.Google Scholar
 Tseng P: Convergence of a block coordinate descent method for nondifferentiable minimization. J Optim Theory Appl 2001, 109(3):475494.MathSciNetView ArticleMATHGoogle Scholar
 Eldar YC, Kuppinger P, Bölcskei H: Blocksparse signals: Uncertainty relations and efficient recovery. IEEE Trans Signal Process 2010, 58: 30423054.MathSciNetView ArticleGoogle Scholar
 Bach FR: Consistency of the group Lasso and multiple kernel learning. J Mach Learn Res 2008, 9: 11791225.MathSciNetMATHGoogle Scholar
 Nardi Y, Rinaldo A: On the asymptotic properties of the group Lasso estimator for linear models. Electron J Stat 2008, 2: 605633.MathSciNetView ArticleMATHGoogle Scholar
 Fuchs JJ: On sparse representations in arbitrary redundant bases. IEEE Trans. Inf Theory 2004, 50(6):13411344.View ArticleMATHGoogle Scholar
 Gorodnitsky I, Rao B: Sparse signal reconstruction from limited data using FOCUSS: a reweighted minimum norm algorithm. IEEE Trans Signal Process 1997, 45(3):600616.View ArticleGoogle Scholar
 Tropp J: Just relax: convex programming methods for identifying sparse signals in noise. IEEE Trans Inf Theory 2006, 52(3):10301051.MathSciNetView ArticleMATHGoogle Scholar
 Candes EJ, Wakin MB, Boyd S: Enhancing sparsity by reweighted ℓ_{1}minimization. J Fourier Anal Appl 2008, 14(5):877905.MathSciNetView ArticleMATHGoogle Scholar
 Zou H, Li R: Onestep sparse estimates in nonconcave penalized likelihood models. Annals Stat 2008, 36: 15091533.MathSciNetView ArticleMATHGoogle Scholar
 Tibshirani R, Saunders M, Rosset S, Zhu J, Knight k: Sparsity and smoothness via the fused Lasso. J Royal Stat Soc Ser B 2005, 67(1):91108.MathSciNetView ArticleMATHGoogle Scholar
 Ruszczynski A: Nonlinear Optimization. Princeton University Press, Princeton, NJ, USA; 2006.MATHGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.