 Research
 Open access
 Published:
Maximum likelihood inference for a class of discretetime Markov switching time series models with multiple delays
EURASIP Journal on Advances in Signal Processing volume 2024, Article number: 74 (2024)
Abstract
Autoregressive Markov switching (ARMS) time series models are used to represent realworld signals whose dynamics may change over time. They have found application in many areas of the natural and social sciences, as well as in engineering. In general, inference in this kind of systems involves two problems: (a) detecting the number of distinct dynamical models that the signal may adopt and (b) estimating any unknown parameters in these models. In this paper, we introduce a new class of nonlinear ARMS time series models with delays that includes, among others, many systems resulting from the discretisation of stochastic delay differential equations (DDEs). Remarkably, this class includes cases in which the discretisation time grid is not necessarily aligned with the delays of the DDE, resulting in discretetime ARMS models with real (noninteger) delays. The incorporation of real, possibly long, delays is a key departure compared to typical ARMS models in the literature. We describe methods for the maximum likelihood detection of the number of dynamical modes and the estimation of unknown parameters (including the possibly noninteger delays) and illustrate their application with a nonlinear ARMS model of El Niño–southern oscillation (ENSO) phenomenon.
1 Introduction
1.1 Background
Discretetime autoregressive Markov switching (ARMS) time series models [15] are used to represent realworld signals whose dynamics may change over time. For example, if \(\{x_n\}_{n\ge 0}\) is the signal of interest, we may model its evolution as
where \(n \ge 0\) is the current time, \(x_{0:n1}=\{ x_0, x_1, \ldots , x_{n1} \}\) is the signal history, \(\{u_n\}_{n\ge 0}\) is some noise (random) process, \(\alpha\) is a vector of parameters and \(\{l_n\}_{n\ge 0}\) is a Markov chain [40], i.e. a sequence of discrete random indices that change over time according to a Markov kernel that describes the conditional probabilities \(\text {Prob}\left( l_n=i  l_{n1}=j \right)\) for suitable integers i and j. For each distinct value \(l_n\) we have a different map \(\psi [l_n](\cdot ,\cdot ,\cdot )\) and, hence, the dynamics of \(x_n\) change with the evolution of the Markov chain. Also, different functions \(\psi [i]\) and \(\psi [j]\) (\(i \ne j\)) may depend on different parameters in \(\alpha\). ARMS models have found a plethora of applications in statistical signal processing for econometrics [8, 9, 13, 19, 43], engineering [16, 17, 28, 39], the geosciences [1, 25, 32], or complex networks [24], to name a few examples.
Inference for ARMS models involves

(a)
the detection of the number of possible values that the Markov chain \(\{l_n\}_{n\ge 0}\) can take,

(b)
and the estimation of any unknown parameters contained in \(\alpha\).
We may assume, without loss of generality, that \(l_n \in \{ 1, \ldots , L \}\). Following [24], in this paper we refer to each map \(\psi [l]\), \(1 \le l \le L\), as a layer and, hence, task (a) consists in estimating the number of active dynamical layers L from a sequence of data samples \(x_{0:n}\). This is a model order selection problem that can be tackled using the Akaike and Bayesian information criteria (AIC and BIC, respectively) [25, 26, 38, 41], penalised distances [27], penalised likelihoods [32, 34], the threepattern method [38] or the HannanQuinn criterion (HQC) [38]. Linear ARMS models admit an equivalent representation as autoregressive movingaverage (ARMA) systems and, in this case, the number of layers L can also be inferred from the covariance matrix of the ARMA process [9, 48].
As for problem (b), maximum likelihood (ML) and maximum a posteriori estimators can be approximated using different forms of the expectationmaximisation (EM) algorithm [1, 15, 32], while Markov chain Monte Carlo (MCMC) methods have been applied for Bayesian estimation [8, 24, 28, 43, 49]. Simpler moment matching techniques can also be applied for parameter estimation in linear ARMS systems [20].
See [15, 36] for a survey of recent ARMS models and methods.
1.2 Contributions
In this paper we investigate nonlinear ARMS models where the independent noise process \(\{u_n\}_{n\ge 0}\) is additive and each dynamical layer depends on a different delay of the signal. Therefore, we particularise Eq. (1) by assuming that the \(l_n\)th nonlinear map can be written as \(\psi [l_n](x_{0:n1},u_n,\alpha ) = \phi [l_n](x_{n1}, x_{nD[l_n]}, \alpha ) + u_n[l_n],\) where \(\{l_n\}_{n\ge 0}\) is a homogeneous Markov chain, each \(D[l] > 1\) (\(1 \le l \le L\)) is a (possibly long) delay, specific to the lth dynamical layer, \(\phi [l_n]\) is a nonlinear function and \(u_n[l_n]\) is a (layerspecific) noise process. The time series model (1) hence becomes
A detailed description of the relevant family of models is given in Sect. 2. Our formulation is devised to target time series models that result from the discretisation of delay differential equations (DDEs) [4, 7], which appear often in geophysics [22, 46].
The specific contributions of this work can be summarised as follows:

We introduce an ARMS time series model that includes systems resulting from the discretisation of stochastic DDEs. In particular, the proposed model includes cases in which the times at which the signal can be observed are not necessarily aligned with the relevant delays (which are often unknown) resulting in discretetime models with real (noninteger) delays.

We propose a stability criterion for the new nonlinear ARMS models with multiple delays. In particular, we describe sufficient conditions on the dynamics of the constituent layers that ensure the existence of finite bounds for the moments of the ARMS model up to a given order.

We provide an EM framework for parameter estimation, based on space alternation and a simple stochastic optimisation algorithm, that can be systematically implemented for the models of the proposed class. This scheme can be easily combined with an ML detector of the number L of dynamical layers.

We illustrate the application of the proposed model and inference methodology by discretising and then fitting a stochastic DDE which has been proposed as a representation of El Niño–southern oscillation (ENSO) [46]. We obtain numerical results for models with up to four dynamical layers and either integer or real delays. In this example, noninteger delays appear naturally when the observation times are not aligned with the physical delays. We validate the model and estimation algorithm using synthetically generated observations and then apply the methodology to real ENSO data. The prediction accuracy of the proposed nonlinear delayed ARMS model is compared with several deep learningbased time series forecasting models.
1.3 Notation
Hereafter, scalar magnitudes are denoted by regularface letters, e.g. x. Column vectors and matrices are represented by boldface letters, either lower case or upper case, respectively. For example, \(\varvec{x} = \left( x_1, \ldots , x_m \right) ^\top\) is an \(m \times 1\) vector (the superscript \(^\top\) denotes transposition) and \(\varvec{X} = \left( \varvec{x}_1, \ldots , \varvec{x}_d \right)\) is an \(m \times d\) matrix, with \(\varvec{x}_i = \left( x_{1i}, \ldots , x_{mi} \right) ^\top\) denoting its ith column. Discrete time is indicated as a subscript, e.g. \(\varvec{x}_n\). Dependences on an integer index other than time are represented with the index between square brackets, e.g. D[l] in (2) is the delay associated to the lth dynamical layer.
We often abide by a simplified notation for probability functions, where p(x) denotes the probability density function (pdf) of the random variable (r.v.) x. This notation is argumentwise, hence if we have two r.v.’s x and y, then p(x) and p(y) denote the corresponding density functions, possibly different; p(x, y) denotes the joint pdf and p(xy) is the conditional pdf of x given y. The notation for multidimensional r.v.’s, e.g. \(\varvec{x}\) and \(\varvec{y}\), is analogous, i.e. \(p(\varvec{x},\varvec{y})\), \(p(\varvec{x}\varvec{y})\), etc. The probability mass function (pmf) of a discrete r.v. x is denoted P(x) (note the upper case) and we follow the same argumentwise convention as for pdf’s.
1.4 Contents
The rest of the paper is organised as follows. In Sect. 2, we introduce the new class of discretetime, nonlinear ARMS models with multiple delays and introduce a stability criterion. An EM framework for inference is described in Sect. 3, and computer simulation examples are presented in Sect. 4. A case study with real ENSO sea surface temperature anomalies is presented in Sect. 5. Section 6 is devoted to the conclusions.
2 Time series models
2.1 Delayed nonlinear ARMS time series models
We introduce a nonlinear ARMS time series model with L layers, i.e. L different dynamical modes, each one induced by a different nonlinear map and a different integer delay. An extended model with noninteger delays is described in Sect. 2.2.
Let \(\{\varvec{x}_n\}_{n\ge 0}\) be a random sequence taking values in \(\mathbb {R}^d\), and let \(\{l_n\}_{n\ge 0}\) be a homogeneous Markov chain, taking values on the finite set \(\mathcal {L}= \{1, \ldots , L\}\), with \(L \times L\) transition matrix denoted as \(\varvec{M}\) and initial pmf \(P_0:\mathcal {L}\mapsto [0,1]\). The entry in the ith row and jth column of \(\varvec{M}\), denoted \(M_{ij}\), is the probability mass \(P(l_n=j  l_{n1}=i)\). The delayed nonlinear ARMS model with L layers and initial value \(\varvec{x}_0\), denoted DNARMS(L), is constructed as
where \(\boldsymbol{\alpha } = (\alpha _1, \ldots , \alpha _k )^\top\) is a \(k \times 1\) vector of real model parameters, \(\Lambda _D = \{ D[1], \ldots , D[L] \}\) is a set of positive integer delays, one per dynamical layer, the functions \(\phi [1], \ldots , \phi [L]\) are distinct \(\mathbb {R}^d \times \mathbb {R}^d \times \mathbb {R}^k \mapsto \mathbb {R}^d\) nonlinear maps, and \(\{\varvec{u}_n[1]\}_{n\ge 1}, \ldots , \{\varvec{u}_n[L]\}_{n \ge 1}\) are independent sequences of \(d\times 1\) i.i.d. random vectors with layerdependent distinct pdf’s \(p_l(\varvec{u})\), \(l=1, \ldots , L\). The model description is complete with a prior pdf \(p(\varvec{x}_{D^+:1})\), where \(D^+ = \max _{l \in \mathcal {L}} D[l]\) is the maximum delay and \(\varvec{x}_{i:j}\) denotes the subsequence \(\varvec{x}_i, \varvec{x}_{i+1}, \ldots , \varvec{x}_j\). Note that a distribution for the r.v. \(\varvec{x}_0\) is not sufficient to initialise the model because of the delays D[l] (it is often necessary to assume some signal values for \(n<0\)).
2.2 Continuousdelay Markovswitch nonlinear models
Model (3) can be extended to incorporate real positive delays. Noninteger delays arise, for example, from the discretisation of stochastic DDEs when the continuoustime delays are not aligned with the time grid of the observed series. Such scenarios are quite natural. For example, in Sect. 4.1 we look into models of ENSO temperature anomalies. These temperatures are typically collected on a monthly basis; however, there is no physical reason for the delays in the DDE models to be an integer number of months.
Assume that \(D[l] \in [1,+\infty )\) for all \(l \in \mathcal {L}\). Model (3) can be extended to account for possibly noninteger delays if we rewrite it as
where \(\tilde{\varvec{x}}_{nD[l]}\) is constructed as an interpolation of consecutive elements of the series \(\{\varvec{x}_n\}_{n\ge 0}\). In general, for \(\tau \in \mathbb {R}^+\), we let \(\tilde{\varvec{x}}_{\tau } = \sum _{m=0}^\infty \kappa (\tau m) \varvec{x}_m,\) where \(\kappa : \mathbb {R}\mapsto \mathbb {R}\) is an interpolation kernel satisfying that \(\tilde{\varvec{x}}_\tau = \varvec{x}_\tau\) when \(\tau\) is an integer. In the computer experiments of Sects. 4 and 5 we restrict our attention, for simplicity, to the order 1 interpolation
where, for a real number \(\tau \in \mathbb {R}\), \(\lfloor \tau \rfloor = \sup \{n \in \mathbb {Z}: n<\tau \}\).
Let us remark that \(\tilde{\varvec{x}}_{nD[l_n]}\) in Eq. (4) is not an observed data point; however, it can be deterministically computed from observed data. Also, model (4) reduces to model (3) when \(D[1], \ldots , D[L]\) are all integers. We refer to model (4), with real delays, as cDNARMS(L).
2.3 Stability analysis
2.3.1 Ergodicity of nonlinear ARMS models
While several authors have analysed the properties of specific ARMStype models [2, 5, 15, 44], general results for discretetime nonlinear ARMS processes are scarce. Perhaps, the most relevant reference is [47], where Yao and Attali provide sufficient conditions for nonlinear, firstorder ARMS models to have an invariant distribution with finite moments. Their analysis deals with systems with no delays or, equivalently in our notation, with the case where \(D[1]=\cdots =D[L]=1\). As a consequence, the models they analyse can be written as
with known paramater vector \(\boldsymbol{\alpha }\).
A key assumption in the analysis of [47] is that the functions \(\phi [l]\), \(l=1, \ldots , L\), need to be either sublinear or Lipschitz with sufficiently small constants. In particular, if the functions are Lipschitz, i.e. they satisfy \(\Vert \phi [l](\varvec{x},\boldsymbol{\alpha })  \phi [l](\varvec{x}',\boldsymbol{\alpha }\Vert \le c[l] \Vert \varvec{x}  \varvec{x}' \Vert\) for some constants \(c[l]<\infty\), \(l=1, \ldots , L\), then the convergence theorems in [47] rely on the assumption
where \(P_\infty\) is the limit distribution of the homogeneous Markov chain \(\{l_n\}_{n\ge 0}\) with transition matrix \(\varvec{M}\). Unfortunately, the inequality (6) does not hold when the functions \(\phi [l]\) result from the discretisation of differential equations, which is the main focus of the cDNARMS(L) algorithm.
For example, assume that the lth layer of an ARMS model is derived from a simple stochastic differential equation [35] of the form
where t denotes continuous time and \(\varvec{w}(t)\) is a standard multivariate Wiener process. The EulerMaruyama time discretisation of (7) with time step \(h>0\) yields
where \(\varvec{z}_n\) is a sequence of i.i.d. standard Gaussian random vectors. A simple inspection of (8) shows that the corresponding lth function in the ARMS model becomes
If f[l] is Lipschitz with constant A[l], then \(\phi [l]\) is Lipschitz with constant \(c[l] = 1 + hA[l]>1\). Since this is true for \(l=1, \ldots , L\), then \(\sum _{l=1}^L P_\infty (l)\log \left( c[l]\right) >0\), which violates the assumptions in the analysis of [47]. The same issue arises in the case of sublinear functions.
2.3.2 Stability of cDNARMS(L) models
To avoid the difficulties with the method of [47], we seek a different characterisation of the stability of cDNARMS(L) processes. In particular, we aim at finding conditions that guarantee that the random sequence \(\{\varvec{x}_n\}_{n\ge 0}\) generated by a cDNARMS(L) model has finite moments of order \(q \ge 1\) for all n. To be specific, let \(\Vert \varvec{v} \Vert\) denote the Euclidean norm of a vector \(\varvec{v}\) and let \(\mathbb {E}_{\varvec{x}}[ \cdot ]\) denote expectation w.r.t. the random vector \(\varvec{x}\) in the subscript. We seek sufficient conditions to ensure \(\sup _n \mathbb {E}_{\varvec{x}_n}[ \Vert \varvec{x}_n \Vert ^q ]<\infty\) for some \(q \ge 1\).
Let \(\{\pi _n\}_{n \ge 0}\) denote a specific family of conditional pdf’s and let \(\varvec{y}_0 \sim \pi _0(\varvec{y}_0)\) and \(\varvec{y}_n \sim \pi _n(\varvec{y}_n\varvec{y}_{0:n1})\) be the vector random sequence generated by \(\{\pi _n\}_{n \ge 0}\). If we choose an integer \(n_0 \ge 0\) and an arbitrary collection of random vectors \(\bar{\varvec{y}}_{0:n_0}\), then we can construct a new sequence \(\{ \bar{\varvec{y}}_n \}_{n \ge 0}\) where, for \(n > n_0\), the random vectors \(\bar{\varvec{y}}_n\) are generated by the same conditional pdf’s \(\{\pi _{n}\}_{n > n_0}\), i.e. \(\bar{\varvec{y}}_n \sim \pi _n(\bar{\varvec{y}}_n\bar{\varvec{y}}_{0:n1})\). Let us refer to the new sequence \(\{\bar{\varvec{y}}_n\}_{n \ge 0}\) as a patched version of the original \(\{\varvec{y}_n\}_{n \ge 0}\), where the patch is the initial subsequence \(\bar{\varvec{y}}_{0:n_0}\). We now introduce a notion of stability that is slightly stronger than just requiring bounded moments.
Definition 1
Let \(\{\varvec{y}_n\}_{n \ge 1}\) be a sequence of random vectors such that
for some \(q \ge 1\) and some constant \(c_\infty ^q < \infty\). The sequence \(\{\varvec{y}_n\}_{n \ge 1}\) is qstable if, and only if, every patched version \(\{ \bar{\varvec{y}}_n \}_{n \ge 0}\) satisfies
Notation \(a \vee b\) denotes the maximum between a and b. Intuitively, a random sequence is qstable when

(a)
it has finite moments of order q, and

(b)
these moments remain bounded when we force an arbitrary initialisation, as long as these initial vectors (the patch) have bounded moments as well.
The notion of qstability applies in a straightforward way to the class of cDNARMS(L) models given by Eq. (4) and a Markov chain \(\{l_n\}_{n\ge 0}\). To see it, let us denote as
the random sequences independently generated by each one of the L layers alone. We can state the following stability result.
Proposition 1
If \(\mathbb {E}_{\varvec{x}_0}\left[ \Vert \varvec{x}_0 \Vert ^q \right] <\infty\) and the random sequences \(\{ \varvec{x}_n^{(l)} \}_{n\ge 0}\), \(l=1, \ldots , L\), are qstable, then the corresponding cDNARMS(L) model with Markov chain \(\{l_n\}_{n\ge 0}\) taking values in \(\{1, \ldots , L\}\) is qstable as well.
A detailed proof is presented in Appendix A. Note that Proposition 1 does not require the Markov chain \(\{l_n\}_{n\ge 0}\) to be homogeneous. Also, the condition that all sequences \(\{ \varvec{x}_n^{(l)} \}_{n\ge 0}\) be qstable is sufficient, but we conjecture that is not necessary. It seems clear that the model may include layers which are not qstable yet, if these layers are visited with low probability and/or the Markov chain \(\{l_n\}_{n\ge 0}\) dwells on those layers for very short periods of time, the overall sequence \(\{\varvec{x}_n\}_{n\ge 0}\) may still be qstable. An extended analysis is left for future work.
3 Model inference
We introduce a spacealternating (SA) EM algorithm for iterative ML parameter estimation in the general cDNARMS(L) model described in Sect. 2.2. First, we obtain the likelihood for the proposed class of models, recall the standard EM method and explain why it is not tractable. We then describe the SAEM scheme and conclude this section with a succinct discussion on the ML detection of the number L of dynamical layers.
3.1 Likelihood function
Let \(\varvec{x}_{0:T}=\{ \varvec{x}_0, \ldots , \varvec{x}_T \}\) be a sequence of observations (and assume that \(\varvec{x}_n\) is given for all \(n<0\)). The set of model parameters to be estimated is \(\Lambda = \Lambda _M \cup \Lambda _D \cup \Lambda _\alpha\), where
are the set of entries of the \(L\times L\) transition matrix \(\varvec{M}\), the set of (possibly real) delays and the set of entries of the \(k\times 1\) vector \(\boldsymbol{\alpha }\), respectively.
We denote the likelihood of the parameter set \(\Lambda\) given the observed sequence \(\varvec{x}_{0:T}\) as \(p(\varvec{x}_{0:T}  \Lambda )\). In order to obtain an explicit expression for the likelihood we write it in terms of the joint distribution of \(\varvec{x}_{0:T}\) and the Markov sequence of layers \(l_{0:T} = \{ l_0, \ldots , l_T \}\), namely
For any \(0 < n \le T\), we can use the chain rule to obtain a recursive decomposition of the joint distribution,
where we have used the Markov property and the fact that \(l_n\) is conditionally independent of \(\varvec{x}_{0:n1}\) to show that \(p(l_nl_{1:n1},\varvec{x}_{1:n1},\Lambda ) = M_{l_{n1},l_n}\) and \(p(\varvec{x}_n \varvec{x}_{0:n1},l_{0:n},\Lambda ) = p(\varvec{x}_n \varvec{x}_{0:n1},l_n,\Lambda )\). If we (repeatedly) substitute the recursive relationship (10) in Eq. (9), we readily obtain
where \(P_0(l_0)\) is the initial pmf of the Markov chain. All factors in the expression above can be computed. In particular, note that \(p(\varvec{x}_0l_0,\Lambda )\) is tractable because we have assumed that \(\varvec{x}_{1}, \varvec{x}_{2}, \ldots\) are known.
A ML estimator of the parameters is a solution of the problem \(\Lambda _{ML} \in \arg \max _{\Lambda } p(\varvec{x}_{0:T}\Lambda )\). However, even when the \(\phi [l]\)’s are linear, it is not possible to compute \(\Lambda _{ML}\) exactly and we need to resort to numerical approximations [15].
3.2 Expectationmaximisation algorithm
Let x and y be r.v.’s, let \(\theta\) be some parameter and let f(x) be some transformation of x. We write \(\mathbb {E}_{xy,\theta }[ f(x) ]\) to denote the expected value of f(x) w.r.t. the distribution with pdf \(p(xy,\theta )\), i.e. \(\mathbb {E}_{xy,\theta }[ f(x) ] = \int f(x) p(xy,\theta ) \textsf{d}x.\)
A standard EM algorithm [11, 31] for the iterative ML estimation of \(\Lambda\) from the data sequence \(\varvec{x}_{0:T}\) can be outlined as follows.

1.
Initialisation: choose an initial (arbitrary) estimate \({\hat{\Lambda }}_0\).

2.
Expectation step: given an estimate \({\hat{\Lambda }}_i\), obtain the expectation
$$\begin{aligned} \mathcal {E}_i(\Lambda ) = \mathbb {E}_{l_{0:T}\varvec{x}_{0:T}, {\hat{\Lambda }}_i}\left[ \log \left( p(\varvec{x}_{0:T},l_{0:T}\Lambda ) \right) \right] . \end{aligned}$$(12) 
3.
Maximisation step: obtain a new estimate,
$$\begin{aligned} {\hat{\Lambda }}_{i+1} \in \arg \max _{\Lambda } \mathcal {E}_i(\Lambda ). \end{aligned}$$(13)
Using standard terminology, \(\varvec{x}_{0:n}\) is the observed data, \(l_{0:T}\) is the latent data and \(\{ \varvec{x}_{0:T},l_{0:T} \}\) is the complete dataset. The basic guarantee provided by the EM algorithm is that the estimates \({\hat{\Lambda }}_0, {\hat{\Lambda }}_1, \ldots\) have nondecreasing likelihoods, i.e. for every \(i \ge 0\), \(p(\varvec{x}_{0:T}{\hat{\Lambda }}_{i+1}) \ge p(\varvec{x}_{0:T}{\hat{\Lambda }}_i)\) [31].
If we substitute the decomposition (11) into the expectation of (12) we arrive at
where
and
Now, if we write the posterior expectations \(\mathbb {E}_{l_n\varvec{x}_{0:T},{\hat{\Lambda }}_i}[\cdot ]\) and \(\mathbb {E}_{l_{n1:n}\varvec{x}_{0:T},{\hat{\Lambda }}_i}[\cdot ]\) explicitly we obtain
where \(P(l_n\varvec{x}_{0:n},{\hat{\Lambda }}_i)\) and \(P(l_{n1},l_n\varvec{x}_{0:n},{\hat{\Lambda }}_i)\) are the posterior pmf’s of the random indices \(l_n\) and \(l_{n1:n}\), respectively, conditional on the observations \(\varvec{x}_{0:n}\) and the ith parameter estimators \({\hat{\Lambda }}_i\).
The posterior pmf’s \(P(l_n\varvec{x}_{0:n},{\hat{\Lambda }}_i)\) and \(P(l_{n1},l_n\varvec{x}_{0:n},{\hat{\Lambda }}_i)\) can be computed exactly, for every \(n=0, \ldots , T\), by running a forwardbackward algorithm [15, 34]. Given these probabilities, from the sequence of Eqs. (13), (14), (15) and (16) it is apparent that we can update^{Footnote 1}
Unfortunately, the update of the delays (in the set \({{\hat{\Lambda }}}_{D,i}\)) and the model parameters \(\boldsymbol{\alpha }\) (in the set \({{\hat{\Lambda }}}_{\alpha ,i}\)) cannot be carried out analytically, i.e. the problem
is intractable in general. Moreover, the number of parameters in \(\Lambda _D \cup \Lambda _\alpha\) is typically large and they may be subject to constraints (e.g. positive parameters, or parameters contained in a certain interval), which makes numerical optimisation hard in practice. As a workaround, we propose a SAEM algorithm [14] that can be used systematically for the class of models described by (4).
3.3 Spacealternating expectationmaximisation algorithm
The intuitive idea is to split the parameters in three sets: one containing the entries of matrix \(\varvec{M}\) (since we can solve (17) exactly), another one containing the parameters that admit an exact update and a third set containing the parameters for which the update has to be done approximately, using numerical optimisation algorithms. Then, it is possible to cycle through these three sets, updating one or more parameters at a time, while all others are kept fixed. This approach is often termed ‘spacealternating’. It makes algorithm design simpler and still guarantees that the likelihood of the sequence of generated estimates is nondecreasing (as in the standard EM method).
To be specific, let us repartition the parameter set as \(\Lambda = \Lambda _M \cup \Lambda _* \cup \Lambda _c\), where \(\Lambda _M\), \(\Lambda _*\) and \(\Lambda _c\) are disjoint sets, and

\(\Lambda _M\) contains the \(L \times L\) entries of the transition matrix \(\varvec{M}\), as before;

\(\Lambda _* \subseteq \Lambda _D \cup \Lambda _\alpha\) contains the parameters that can be updated exactly (when all others are kept fixed);

and \(\Lambda _c = \left( \Lambda _D \cup \Lambda _\alpha \right) \backslash \Lambda _*\) contains the remaining parameters, which have to be updated approximately, using numerical optimisation.
With this partition, we ensure the ability to solve the problem
exactly (note that \(\Lambda _* \cup \Lambda _c = \Lambda _D \cup \Lambda _\alpha\), hence \(\mathcal {E}_i^0(\Lambda _* \cup {\hat{\Lambda }}_{c,i})\) is well defined).
We can now summarise the proposed SAEM algorithm as follows.

1.
Initialisation: choose an initial (arbitrary) estimate \({\hat{\Lambda }}_0 = {\hat{\Lambda }}_{M,0} \cup {\hat{\Lambda }}_{*,0} \cup {\hat{\Lambda }}_{c,0} = {\hat{\Lambda }}_{M,0} \cup {\hat{\Lambda }}_{D,0} \cup {\hat{\Lambda }}_{\alpha ,0}\).

2.
Expectation step: given an estimate \({\hat{\Lambda }}_i\), run a forwardbackward algorithm to compute the pmf’s \(P(l_n\varvec{x}_{0:n},{\hat{\Lambda }}_i)\) for all \(l_n \in \mathcal {L}\) and \(P(l_{n1:n}\varvec{x}_{0:n},{\hat{\Lambda }}_i)\) for all \(l_{n1:n} \in \mathcal {L}^2\). Then construct the expectations \(\mathcal {E}_i^1(\Lambda _M)\) and \(\mathcal {E}_i^0(\Lambda _*\cup \Lambda _c)\) in Eqs. (15) and (16), respectively.

3.
Maximisation step: compute
$$\begin{aligned} {\hat{\Lambda }}_{M,i+1} = \arg \max _{\Lambda _M} \mathcal {E}_i^1(\Lambda _M) \end{aligned}$$and denote \(\Lambda _c = \{ \lambda ^c_1, \ldots , \lambda ^c_q \}\) and \({\hat{\Lambda }}_{c,i} = \{ \lambda ^c_{1,i}, \ldots , \lambda ^c_{q,i} \}\), where \(q \le L+k\) is the number of parameters in \(\Lambda _c\). For \(j = 1, \ldots , q\), compute
$$\begin{aligned} {\hat{\lambda }}^c_{j,i+1} \text {~~such that~~} \max _{\Lambda _*} \mathcal {E}_i^0(\Lambda _* \cup {\hat{\Lambda }}_{c,i+1}^{j+}) \ge \max _{\Lambda _*} \mathcal {E}_i^0(\Lambda _* \cup {\hat{\Lambda }}_{c,i+1}^{j}) \end{aligned}$$(18)where
$$\begin{aligned} {\hat{\Lambda }}_{c,i+1}^{j+}:= & {} \left\{ {{\hat{\lambda }}}^c_{1,i+1}, \ldots , {{\hat{\lambda }}}^c_{j1,i+1}, {{\hat{\lambda }}}^c_{j,i+1}, {{\hat{\lambda }}}^c_{j+1,i}, \ldots , {{\hat{\lambda }}}^c_{q,i} \right\} , \text {~~and~~}\\ {\hat{\Lambda }}_{c,i+1}^{j}:= & {} \left\{ {{\hat{\lambda }}}^c_{1,i+1}, \ldots , {{\hat{\lambda }}}^c_{j1,i+1}, {{\hat{\lambda }}}^c_{j,i}, {{\hat{\lambda }}}^c_{j+1,i}, \ldots , {{\hat{\lambda }}}^c_{q,i} \right\} , \end{aligned}$$to obtain \({\hat{\Lambda }}_{c,i+1} = \left\{ {{\hat{\lambda }}}^c_{1,i+1}, \ldots , {{\hat{\lambda }}}^c_{q,i+1} \right\}\). Finally, let
$$\begin{aligned} {\hat{\Lambda }}_{*,i+1} = \arg \max _{\Lambda _*} \mathcal {E}_i^0\left( \Lambda _* \cup {\hat{\Lambda }}_{c,i+1} \right) \end{aligned}$$and \({{\hat{\Lambda }}}_{i+1} = {{\hat{\Lambda }}}_{M,i}\cup {{\hat{\Lambda }}}_{*,i+1}\cup {{\hat{\Lambda }}}_{c,i+1}\).
Intuitively, at the \((i+1)\)th iteration of the algorithm we update the parameters in \(\Lambda _M\) exactly, then we update the parameters \(\lambda _1^c, \ldots , \lambda ^c_q \in \Lambda _c\) one at a time in (18), and finally we update the parameters in \(\Lambda _*\) exactly. The 1dimensional updates in (18) can be numerically carried out in various ways. We suggest the application of the accelerated random search (ARS) method of [3] (see also [29] and Appendix B), which is straightforward to apply and has performed well in our computer experiments (presented in Sect. 4).
The SAEM algorithm can be run for a fixed, prescribed number of iterations or stopped when some criterion is fulfilled, e.g. that the difference between successive estimates is smaller than a given threshold.
Remark 1
While bearing similarity, the SAEM scheme introduced in this paper does not strictly below to the class of algorithms described in [14]. Nevertheless, it is not hard to prove that the estimates \({{\hat{\Lambda }}}_i\) generated by the SAEM scheme have nondecreasing likelihoods, i.e. \(p(\varvec{x}_{0:n}{{\hat{\Lambda }}}_{i+1}) \ge p(\varvec{x}_{0:n}{{\hat{\Lambda }}}_i)\) for every \(i \ge 1\). this is the same property enjoyed by the standard EM method and the generalised algorithms of [14].
3.4 Estimation of the number of layers L
When the number of dynamical layers, L, in the cDNARMS(L) model is unknown we can also use the proposed SAEM algorithm to estimate it. In particular, assume that \(L \in \{c^, \ldots , c^+\}\), i.e. the number of layers L of the model is at least \(c^\) and at most \(c^+\), so that its estimation can be restricted to a finite set of values. We can run I iterations of the the SAEM algorithm for each \(L\in \{c^, \ldots , c^+\}\) in order to obtain approximate likelihoods
where \({\hat{\Lambda }}_I\) is the set of parameter estimates after I iterations (note that the number of elements in this set increases with L). The likelihood \(p(\varvec{x}_{0:T}{{\hat{\Lambda }}}_I)\) above can be computed exactly as a byproduct of the forwardbackward algorithm run in the maximisation step, with a computational cost \(\mathcal {O}(T)\) (see [15] for details).
Choosing the value of \(L\in \{c^,\ldots ,c^+\}\) that maximises \(\ell _T(L)\) typically leads to overestimation due to overfitting. This is illustrated by the computer simulations of Sect. 4.2.1, where it is shown that increasing the number of layers beyond its true value can lead to an increase of the likelihood function \(\ell _T(L)\) (see Table 1).
Following [38], we adopt a penalised likelihood estimator of the number of layers. In particular, we have run computer experiments with a penalisation of the likelihood of the form \(e^{C_T \Lambda _L}\), where \(C_T = \frac{1}{2}\log (T)\) and \(\Lambda _L\) denotes the number of parameters in the set \(\Lambda\) when the model has L layers. This yields the penalised likelihood \({\tilde{\ell }}_T(L):= e^{C_T \Lambda _L } p(\varvec{x}_{0:T}{{\hat{\Lambda }}}_I)\) and the penalised estimator
4 Computer simulations
4.1 El Niño–southern oscillation model
El Niño–southern oscillation (ENSO) is a recurring event belonging to a class of climatic phenomena called atmospheric oscillations. It originates from variations in wind intensities in the general atmospheric circulation. These variations cause the oscillation of the thermocline in the Pacific Ocean which, in turn, causes alternating high and low sea surface temperatures (SSTs) on both sides of the ocean. In particular, the El Niño phenomenon corresponds to an increase in SST in the eastern Pacific, which is associated with a strong increase in rainfall intensity and duration in Central and South America. The anomalies in the trade wind and the SST at the Pacific Ocean’s equator have been historically modelled as the solution of a DDE [37, 45, 46]. Several conceptual equations have also been proposed in the literature [18, 21] that incorporate stochastic terms or display chaotic dynamics.
For the computer experiments in this section we consider a nonlinear DDE based on the model of Ghil et al. [18], where we include a diffusion term to obtain the stochastic DDE in Itô form
where t denotes continuous time (the time unit is 1 year), \(\mathcal {T}(t)\) is the SST anomaly, \(\tau \in \mathbb {R}^+\) is a time delay, and a, b, \(\kappa\) and \(\omega\) are constants, W(t) is a standard Wiener process and \(\sigma >0\) is a diffusion coefficient that determines the intensity of the stochastic perturbation.
Equation (19) can be integrated numerically using different schemes [7]. For simplicity, we apply an EulerMaruyama scheme with constant timestep \(h>0\), which yields
where \(\mathcal {T}_n \approx \mathcal {T}(nh)\) is an approximation of the SST anomaly process at \(t=nh\), D is a discrete delay computed as \(D=\frac{\tau }{h}\) (we assume that \(\tau\) can be expressed as an integer multiple of h) and \(u_n\) is an i.i.d. sequence of \(\mathcal {N}(0,1)\) (standard Gaussian) r.v.’s. Note that the drift in Eq. (20) is periodic, hence \(\mathcal {T}_n\) has bounded second order moment (provided that h and \(\sigma\) are sufficiently small).
Model (19) and its discretised version (20) can be shown to yield sequences of temperatures which are (qualitatively and quantitatively) similar to the measurements of SST anomalies in the South Pacific ocean. However, these models are not accurate enough for reliable forecasting and, in particular, it has not been possible to use them to predict the large “spikes” in SST that characterise the El Niño phenomenon.
In an attempt at (a) extending the applicability of model (19) and (b) illustrating the proposed multilayer modelling approach, we address the construction of multilayer cDNARNMS(L) models where each of the L layers corresponds to a difference equation of the form in (20) with its own set of static parameters \(\{ a[l],b[l],\kappa [l],\omega [l],\sigma [l]\}\), \(l \in \{1, \ldots , L\}\). Such multilayer model also requires an \(L\times L\) transition matrix \(\varvec{M}\) that describes the Markov switching mechanism.
4.2 DNARMS(L) model with integer delays
Let us construct a DNARMS(L) model of the form in Eq. (3) for SST anomaly time series. Publicly available SST data for ENSO is given in the form of monthly averaged SSTs and SST anomalies. For this reason, we adopt \(h=\frac{1}{12}\) as a time step in the Euler scheme (20). This time step is known and common to all L layers of the DNARMS(L) model. Additionally, we define:

A Markov chain \(\{l_n\}_{n\ge 0}\) taking values in \(\mathcal {L}=\{1, \ldots , L\}\) with an \(L \times L\) transition matrix \(\varvec{M}\).

A set of delays \(\tau [1], \ldots , \tau [L]\), one per layer. We assume that these delays are integer multiples of \(h=\frac{1}{12}\); hence, they turn into integer delays \(D[1]=\frac{\tau [1]}{h}, \ldots , D[L]=\frac{\tau [L]}{h} \in \mathbb {N}\).

Parameters \(\{ a[l], b[l], \kappa [l], \omega [l], \sigma [l] \}\) for each \(l \in \mathcal {L}\); hence, the parameter vector \(\boldsymbol{\alpha } = \left( a[1], b[1], \ldots , a[L], b[L], \kappa [L], \omega [L], \sigma [L] \right) ^\top\) is \(5\,L \times 1\) (i.e. \(k=5\,L\) in the general model (3)).
The assumption \(D[l]=\frac{\tau [l]}{h} \in \mathbb {N}\) for every l may be a mild one when h can be chosen to be sufficiently small; however, it is hardly realistic for a time step of one month (and we drop it in Sect. 4.3).
The resulting DNARMS(L) model of the ENSO time series can be compactly written as
where \(x_n\) is the SST anomaly at time \(t=nh\) and \(\{u_n\}_{n\ge 0}\) is an i.i.d. \(\mathcal {N}(0,1)\) sequence. We assume \(x_n \sim \mathcal {N}(0,1)\) for all \(n \le 0\).
4.2.1 Estimation of the number of layers L
In the first set of computer experiments, we assess the estimation of the number of layers L using the penalised maximum likelihood method described in Sect. 3.4. We simulate two datasets \(x_{0:T}^{(2)}\) and \(x_{0:T}^{(3)}\) with true values, \(L_o=2\) and \(L_o=3\), respectively, and \(T=1,000\). Then, for each \(L \in \{2,3,4\}\) we run the SAEM algorithm of Sect. 3.3. Since EM algorithms converge only locally [30], for each \(L_o\) and each L we run the SAEM scheme \(R=50\) times, with the same dataset but different, independently generated initial parameter estimates \({\hat{\Lambda }}^{(r)}_{0}, r = 1,\ldots ,50\). Recall that the SAEM algorithm yields the likelihood as a byproduct (see Sect. 3.4). We denote the likelihood of the model with L layers in the rth simulation as \(\ell _{T}^{(r)}(L)\) and then assign the ML over the R independent simulations, i.e. \(\ell _T(L) = \max _{1\le r \le R} \ell _{T}^{(r)}(L)\).
The true transition matrices are
for \(L_o=2\) and \(L_o=3\), respectively. The remaining parameters are
for \(L_o=2\) and, for \(L_o=3\),
Table 1 shows the loglikelihoods and the penalised loglikelihoods obtained for the models with \(L = 2,3,4,\) when the true model generating the data has \(L_o=2\) layers (left) and \(L_o=3\) layers (right). It is seen that higher likelihoods (\(\ell _T(L)\)) are obtained as L is increased, even when \(L>L_o\). This is due to overfitting of the model parameters. When a simple penalisation is included (\(\tilde{\ell }_T(L)\)), the correct number of dynamical layers is detected in both experiments.
4.2.2 Parameter estimation
In this section, we study the accuracy of the SAEM parameter estimation algorithm with \(L_o=2\) layers described in Sect. 4.2.1 and aim at estimating the transition matrix \(\varvec{M}\) as well as the integer delays \(D[1:2] = \left( D[1], D[2] \right) ^\top\) and the model parameters \(\boldsymbol{\alpha } = \left( a[1:2],b[1:2],\kappa [1:2], w[1:2], \sigma [1:2] \right) ^\top\). We assess the accuracy of the estimators of real parameters in terms of normalised errors. Assume we run R independent simulations, all with the same true parameters (but independently generated observations \(x_{0:T}\)). Then, the normalised estimation errors for the transition matrix \(\varvec{M}\) are
where \(\Vert \cdot \Vert _{F}\) denotes the Frobenius norm and \({\hat{\varvec{M}}}^{(r)}\) represents the estimate of \(\varvec{M}\) in the rth independent simulation. For the parameters in \(\boldsymbol{\alpha }\), the normalised errors are computed as
where \(\alpha _i\) denotes the ith entry of vector \(\boldsymbol{\alpha }\) and \({\hat{\alpha }}_i^{(r)}\) is its estimate in the rth independent simulation.
Since the delays D[1 : 2] are integers, calculating a normalised Euclidean norm does not provide a meaningful characterisation of performance. Instead, we assess the estimation algorithm by computing the frequency of correct detections, i.e. if \({\hat{D}}[l]^{(r)}\) is the estimate of the delay D[l] in the rth independent simulation, for \(r=1, \ldots , R\), then the frequency of (correct) detections is \(F_D:= \sum _{r=1}^R \delta \left[ \sum _{l=1}^{L} {\hat{D}}[l]^{(r)}  D[l] \right]\), where \(\delta [\cdot ]\) is the Kronecker delta function.
The computer experiment consists of the following steps:

i)
Generate \(R=100\) independent realisations of the time series model with \(L_o=2\) described in Sect. 4.2.1. Each signal consists of \(T=1,000\) data points. A sample signal is displayed in Fig. 1a.

ii)
For each independent realisation, extract four subsequences containing the first 250, 500, 750 and 1,000 data points, respectively.

iii)
Generate an initial condition for each realisation, \({\hat{\Lambda }}_0^{(r)}\), \(r=1,..., R\). Apply the SAEM algorithm to each subsequence of each realisation and obtain parameter estimates.

iv)
For each \(r=1,..., R\) and each subsequence, compute the normalised errors for the transition matrix (\(e_{\varvec{M}}(r)\)) and each real parameter (\(e_{\alpha _i}(r)\)), as well as the detection frequency \(F_D\) for the integer delays.
The purpose of this setup is to demonstrate the effectiveness of the SAEM algorithm, study the existence of local maxima of the likelihood and illustrate the improvement in the accuracy of the parameter estimates as the length of the observed series increases.
Figure 1b shows a bar diagram with the absolute frequency of correct detection, \(F_D\), for each data sequence length. Since there are \(R=100\) simulations (for each length), the maximum value of \(F_D\) is \(R=100\). We see that, as the length of the data sequence increases, the value of \(F_D\) improves as well. When the number of data points in the sequence is \(T=1,000\) we obtain \(F_D \approx 95\), i.e. D[1] and D[2] are both detected correctly in \(\approx 95\%\) of the simulations. The delays become mismatched in the simulation runs where the SAEM algorithm converges to a local maximum of the likelihood.
Figures 2 and 3 show box plots of the normalised errors for the remaining parameters. For each parameter and each length of the data sequence, the horizontal line in each box indicates the median of the errors, the box extends between the 0.25 and 0.75 quantiles of the empirical distribution, the whiskers extend to the complete distribution and the circles are outliers (i.e. points located above the upper quartile by 1.5 times the interquartile range). We observe how the median error decreases when more data points are available. As with the delays, outliers are due to simulations where the SAEM algorithm converges to a local maximum of the likelihood that differs significantly from its global maximum. These simulations indicate that, in any practical application, the SAEM algorithm should be run with multiple initialisations (even for a single dataset). One can then select the parameter estimates from the run that attained the highest loglikelihood (which is computed by the SAEM algorithm as a byproduct of the forwardbackward procedure).
4.3 cDNARMS(L) model with noninteger delays
The assumption of integer delays, i.e. that \(\tau [l]\), \(l=1, \ldots , L\), are all integer multiples of the onemonth time step \(h=\frac{1}{12}\), is unrealistic. In this section, we assume that \(D[l] = \frac{\tau [l]}{h} \in (1,+\infty )\) and construct a cDNARMS(L) model of the form in Eq. (4) for the ENSO time series. To be specific, the SAEM algorithm is implemented with the model
where \({\tilde{x}}_{nD[l_n]}\) is computed by linear interpolation as shown in Eq. (5). The transition matrix \(\varvec{M}\) and the parameters \(\boldsymbol{\alpha }\) are defined in the same way as in Sect. 4.2, and \(x_n \sim \mathcal {N}(0,1)\) for all \(n<0\).
4.3.1 Generation of synthetic time series
The interpolated signal \({{\tilde{x}}}_{nD[l]}\) in (22) is an approximation that we impose to incorporate a real delay into a discretetime model. In order to put this approximation to a test, we generate time series data using stochastic DDEs of the form in Eq. (19) that we integrate with an EulerMaruyama scheme on a finer grid, namely, with a time step of the form \(h=\frac{1}{12\,m}\), where m is a positive integer.
In particular, we let again \(L_o=2\) and generate an auxiliary data sequence from the model
where the \(\mathcal {D}[l]\)’s are positive integer delays, \(\{u_i\}_{i\ge 0}\) is a standard Gaussian i.i.d. sequence of noise variables, \(y_i \sim \mathcal {N}(0,1)\) for all \(i<0\), \(l_i \sim P(l_il_{i1})=M_{l_{i1},l_i}\) when i is an integer multiple of m and \(l_i=l_{i1}\) when i is not an integer multiple of m (i.e. the index \(l_i\) can only change every m time steps).
The actual dataset used for the computer simulations is then obtained by subsampling \(\{y_i\}_{i\ge 0}\) by a factor m, namely,
If \(m=2\), the delays \(\mathcal {D}[l]\), which are integers in the discretetime scale of \(\{y_i\}_{i\ge 0}\), become rational in the discretetime scale of \(\{x_n\}_{n\ge 0}\), with possible values of the form \(D[l] \in \left\{ r, r \pm \frac{1}{2} \right\}\), where \(r \in \mathbb {Z}^+\). For general \(m \in \mathbb {Z}^+\), the delays in the subsampled time scale of the series \(\{ x_n \}_{n\ge 0}\) are of the form \(D[l] \in \left\{ r, r + \frac{1}{m}, \ldots , r+\frac{m1}{m} \right\}\), with \(r\in \mathbb {Z}^+\). In this way, we generate a data sequence \(\{x_n\}_{n\ge 0}\) that depends on noninteger delays and does not rely on interpolation or any other approximation based on the observed data.
4.3.2 Parameter estimation
We conduct a set of computer experiments similar to those in Sect. 4.2.2 but using data sequences generated by the procedure in Sect. 4.3.1 to account for noninteger delays. The estimation errors for \(\varvec{M}\) and the parameters in vector \(\boldsymbol{\alpha }\) are computed in the same way as in Sect. 4.2.2. The estimates of the delays for these experiments are real numbers, hence we compute normalised errors of the form \(e_{D[l]}(r) = D[l]{\hat{D}}[l]^{(r)}/D[l]\), where D[l] is the true value of the delay and \({\hat{D}}[l]^{(r)}\) is the estimate in the rth simulation run.
The procedure for the computer experiments is the same as in Sect. 4.2.2, with \(R=100\) independently generated data sequences of length \(T=1,000\), each of them split to obtain subsequences of length 250, 500, 750 and 1,000. The values of the true parameters \(\boldsymbol{\alpha }\) and transition matrix \(\varvec{M}\) used to generate the data are the same as in the model with \(L_o=2\) in Sect. 4.2.1. The auxiliary sequence \(\{y_i\}_{i\ge 0}\) is generated with a time step \(\frac{1}{12m}\), with \(m=2\). The true noninteger delays are \(D[1]=3.5\) and \(D[2]=9.5\).
Figures 4 and 5 show the box plots of the normalised errors for all parameters. We note that the errors are grouped per parameter through the two layers (similar to Figs. 2 and 3). We observe that the median errors decrease consistently as the length of the data sequence increases, except for the real delays D[1:2]. In this case we see that the normalised errors are small (close to 10 ) and their variance reduces with increasing data length, but the median error remains approximately constant. This numerical result indicates that the estimators of D[1] and D[2] present a bias due to the mismatch between model used by the SAEM algorithm, and the model used to generate the data.
5 Experimental results
5.1 Data and models
After validating the performance of the SAEM inference algorithm with synthetic data in Sect. 4, we now tackle the fitting of cDNARMS(L) models using real ENSO data. The dataset consists of four time series of monthly SST anomalies, each corresponding to a different zone of the Pacific Ocean and each consisting of \(T=1,848\) monthly observations, starting in January 1870 and up to December 2023. The series are labelled ENSO 1+2, ENSO 3, ENSO 3+4 and ENSO 4. Figure 6 shows the evolution of the SST anomalies in the 12year period between January 2012 and December 2023. The ‘peaks’ in December 2015 and December 2023 are the most recent El Niño events.
We have applied the EMSA algorithm on each one of these datasets in order to fit different models:

cDNARMS(L) models with \(L=2, 3, 4\), and

a Markov switching model with \(L=4\) layers where each layer is a standard linear AR(4) system (as described, e.g. in [10, Chapter 4]), labelled ARMS(4,4) in the figures and tables of this section.
The specific form of the Markov switching system based on linear AR processes is briefly described in Appendix C. All four models are fitted using the SAEM algorithm. For the cDNARMS(L) models the algorithm is applied in the same way as in Sect. 4.3. For the models with linear AR(4) layers, the algorithm simplifies considerably as \(\Lambda _c = \emptyset\), which implies that there is no need for the ARS optimisation scheme. The overall estimation procedure becomes very similar to the one in [15] (except that the models in [15] are of order 1). Note that the models are fitted for each specific dataset.
In addition to the Markov switching models, we have also used the ENSO datasets to train four deep learningbased schemes for time series forecasting. In particular,

a gated recurrent unit (GRU) neural network (NN) [23],

a multilayer perceptron (MLP) NN [33],

long shortterm memory (LSTM) NN [6], and

the DeepAR model [42].
We have trained each model with a learning rate of 0.001, 50 epochs and time window of the 12 previous months. Otherwise we have selected the combinations of layers and neurons that yielding the best numerical results in our computer experiments. All models are implemented in Python. The GRU, MLP, and LSTM NNs are built using the Keras library (https://keras.io), while the DeepAR model is implemented using the GluonTS library (https://ts.gluon.ai/stable/index.html).
We have used the data spanning from January 1870 to December 1991 (\(T_{train} = 1,464\) observations) in order to train the deep learning models and fit the Markov switching schemes. The remaining data, from January 1992 to December 2023 (\(T_{test} = 384\) observations) has served as the test set to assess the prediction performance.
5.2 Autocorrelation function
Fig. 7 shows the empirical autocorrelation functions computed using the four ENSO datasets and synthetic data generated with the cDNARMS(4), ARMS(4,4) and DeepAR models fitted to each one of the four ENSO xones. Note that the GRU, MLP and LSTM models are nongenerative (they perform a deterministic transformation of their inputs), hence they are not displayed in this comparison.
The three generative models perform relatively well for autocorrelation lags up to 6–7 months in zones 1+2, 3 and 3+4. For ENSO 4, cDNARMS(4) and deepAR underestimate the autocorrelation, while ARMS(4,4) overestimates it. After 6–7 months the discrepancies are larger, with cDNARMS(4) and ARMS(4,4) typically overestimating the autocorrelation. These curves are an average of the empirical autocorrelations of 2,000 independently generated series for each model.
5.3 Forecasting
The forecasting task involves the prediction of the SST anomalies with a lead time of k months. For a given lead time k and a given algorithm, we evaluate the prediction rootmean square error (RMSE) and the Pearson correlation coefficient (PCC), given by
and
where \(x_{1:T_{test}}\) is the signal from the first to the last month of the test period, \(\hat{x}_n^k\) is the forecast with lead time of k months, \(\bar{x}\) is the mean of \(x_n\) and \(\bar{\hat{x}}^k\) is the mean of \(\hat{x}_n^k\). Better performance is achieved for smaller RMSE\(_k\) and larger PCC\(_k\). For The predictions for cDNARMS(4) and ARMS(4,4) are computed using a standard particle filter (whose state is the Markov chain \(l_n \sim \varvec{M}\)) [12] with \(N=500\) particles, while for DeepAR the prediction \({\hat{x}}_n\) is the mean over \(N=500\) generated sequences.
Tables 2, 3, 4 and 5 display the PCCs and RMSEs and PCCs of each model and lead times \(k=1, 3, 6\) and 9 months for ENSO 1+2, ENSO 3, ENSO 3+4 and ENSO 4, respectively. For lead times of \(k=1\) and \(k=3\) months, cDNARMS(4) achieves the best or secondbest performance across all zones both for RMSE and PCC. Its relative performance deteriorates for larger lead times (6 and 9 months). While both the RMSE and PCC values remain to the other methods, GRU is the bestperforming for ENSO 1+2, DeepAR attains the best performance for ENSO 3 and ENSO 3+4, while LSTM and DeepAR are best for ENSO 4.
Finally, Fig. 8 provides a graphical illustration of the forecast for ENSO 3+4 from January 2012 to December 2023, with lead times of 1, 3, 6 and 9 months for Fig. 8a–d, respectively. The figures show the true SST, the mean forecast and the \(\pm 3\sigma\) interval around the mean forecast. We see how predictions are accurate, with relatively small uncertainty and capture well the December 2015 and December 2023 El Niño events. For 6month and 9month forecast the uncertainty is larger. The December 2023 El Niño peak is still within the shaded area, but the actual December 2015 peak occurs earlier than predicted.
6 Conclusions
We have introduced a class of nonlinear autoregressive Markovswitching time series models where each dynamical layer (or submodel) may have a different, possibly noninteger delay. This class includes a broad collection of systems that result from the discretisation of stochastic DDEs where the characteristic delays are not a priori known. Such models are common in geophysics.
The proposed family of models does not necessarily admit an asymptoticregime analysis similar to the classical results of [47] for firstorder Markovswitching systems. Instead, we have identified a certain stability property of the distinct dynamic regimes in the switching model that guarantees that the signals generated by the proposed model have bounded moments up to a given order. We have also introduced numerical methods, based on a spacealternating EM procedure, to detect the number of dynamical layers in the model and to compute ML estimators of any unknown parameters, including the multiple, possibly noninteger delays. The performance of these inference methods has been tested on nonlinear autoregressive Markovswitching models that combine up to four dynamical layers, each one of them originating from a DDE typically used to represent anomalies in the sea surface temperature of (certain regions of) the Pacific Ocean due to the ENSO phenomenon. Realworld ENSO data are recorded as a monthly series, yet the delays that characterise the phenomenon are not known a priori and there is no physical reason for them to be integer multiples of a month. Our computer simulations show that it is possible to detect the number of layers and estimate the parameters of the models using relatively short series of observations. The cDNARMS(L) models fitted using real ENSO data can also be used to forecast strong positive anomalies in the sea surface temperature (El Niño phenomenon). We show a comparison of the predictive capability of the proposed cDNARMS(L) scheme with several deeplearningbased time series forecasting models.
Availability of data and materials
The datasets analysed during the current study are available in the Climate Data Guide repository of the US National Center for Atmospheric Research (NCAR). URL: https://climatedataguide.ucar.edu/climatedata.
Notes
We denote \({\hat{\Lambda }}_{i+1} = {\hat{\Lambda }}_{M,i+1} \cup {\hat{\Lambda }}_{D,i+1} \cup {\hat{\Lambda }}_{\alpha ,i+1}\).
References
P. Ailliot, V. Monbet, Markovswitching autoregressive models for wind time series. Environ. Modell. Softw. 30, 92–101 (2012)
A. Aknouche, C. Francq, Stationarity and ergodicity of Markov switching positive conditional mean models. J. Time Ser. Anal. 43(3), 436–459 (2022)
M.J. Appel, R. Labarre, D. Radulovic, On accelerated random search. SIAM J. Optim. 14(3), 708–730 (2003)
A. Bellen, Marino Zennaro Numerical methods for delay differential equations. (Oxford University Press, Oxford, 2013)
A. Bibi, A. Ghezal, On the Markovswitching bilinear processes: stationarity, higherorder moments and βmixing. Stoch. Int. J. Probab. Stoch. Process. 87(6), 919–945 (2015)
C. BroniBedaiko, F.A. Katsriku, T. Unemi, M. Atsumi, J.D. Abdulai, N. Shinomiya, E. Owusu, El niñosouthern oscillation forecasting using complex networks analysis of lstm neural networks. Artif. Life Robot. 24, 445–451 (2019)
E. Buckwar, Introduction to the numerical analysis of stochastic delay differential equations. J. Comput. Appl. Math. 125(1–2), 297–307 (2000)
R. Casarin, D. Sartore, M. Tronzano, A Bayesian Markovswitching correlation model for contagion analysis on exchange rate markets. J. Bus. Econ. Stat. 36(1), 101–114 (2018)
M. Cavicchioli, Determining the number of regimes in Markov switching VAR and VMA models. J. Time Ser. Anal. 35(2), 173–186 (2014)
J.D. Cryer, K.S. Chan, Time series analysis: with applications in R, vol. 2 (Springer, Berlin, 2008)
A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. Ser. B 39(1), 1–38 (1977)
P.M. Djurić, J.H. Kotecha, J. Zhang, Y. Huang, T. Ghirmai, M.F. Bugallo, J. Míguez, Particle filtering. IEEE Signal Process. Mag. 20(5), 19–38 (2003)
Fernandez, M.F.: Modelling Volatility with MarkovSwitching GARCH Models. PhD thesis, The University of Liverpool, United Kingdom (2022)
J.A. Fessler, A.O. Hero, Spacealternating generalized expectationmaximization algorithm. IEEE Trans. Signal Process. 42(10), 2664–2677 (1994)
Franke, J.: Markov switching time series models. In T. Subba Rao, S.Subba Rao, and C.R. Rao, editors, Time Series Analysis: Methods and Applications, pp. 99–122. Elsevier, (2012)
C. Fritsche, A. Klein, F. Gustafsson, Bayesian CramerRao bound for mobile terminal tracking in mixed LOS/NLOS environments. IEEE Wirel. Commun. Lett. 2(3), 335–338 (2013)
X. Fu, Y. Jia, J. Du, F. Yu, New interacting multiple model algorithms for the tracking of the manoeuvring target. IET Control Theory Appl. 4(10), 2184–2194 (2010)
M. Ghil, I. Zaliapin, S. Thompson, A delay differential model of ENSO variability: parametric instability and the distribution of extremes. Nonlinear Process. Geophys. 15(3), 417–433 (2008)
P. Guérin, M. Marcellino, Markovswitching MIDAS models. J. Bus. Econ. Stat. 31(1), 45–56 (2013)
S. Höcht, K.H. Ng, J. Wiesent, R. Zagst, Fit for leveragemodelling of hedge fund returns in view of risk management purposes. Int. J. Contemp. Math. Sci. 4(19), 895–916 (2009)
F.F. Jin, L. Lin, A. Timmermann, J. Zhao, Ensemblemean dynamics of the ENSO recharge oscillator under statedependent stochastic forcing. Geophys. Res. Lett. (2007). https://doi.org/10.1029/2006GL027372
A. Keane, B. Krauskopf, C.M. Postlethwaite, Climate models with delay differential equations. Chaos Interdiscip. J. Nonlinear Sci. 27(11), 114309 (2017)
J. Kim, M. Kwon, S.D. Kim, J.S. Kug, J.G. Ryu, J. Kim, Spatiotemporal neural network with attention mechanism for El Niño forecasts. Sci. Rep. 12(1), 7204 (2022)
L. Lacasa, I.P. Mariño, J. Miguez, V. Nicosia, É. Roldán, A. Lisica, S.W. Grill, J. GómezGardeñe, Multiplex decomposition of nonMarkovian dynamics and the hidden layer reconstruction problem. Phys. Rev. X 8(3), 031038 (2018)
R. Le Goff Latimier, E. Le Bouedec, V. Monbet, Markov switching autoregressive modeling of wind power forecast errors. Electric Power Syst. Res. 189, 106641 (2020)
B.G. Leroux, Leroux, Maximumlikelihood estimation for hidden Markov models. Stoch. Process. and Their Appl. 40(1), 127–143 (1992)
R.J. MacKay, Estimating the order of a hidden Markov model. Can. J. Stat. 30(4), 573–589 (2002)
Magnant, C., Giremus, A., Grivel, E., Ratton, L., Joseph, B.: Dirichletprocessmixturebased Bayesian nonparametric method for Markov switching process estimation. In 2015 23rd European Signal Processing Conference (EUSIPCO), pp. 1969–1973. IEEE, (2015)
I.P. Mariño, J. Míguez, Monte Carlo method for multiparameter estimation in coupled chaotic systems. Phys. Rev. E 76(5), 057203 (2007)
G.J. McLachlan, D. Peel, Finite Mixture Models (John Wiley & Sons, New York, 2000)
J.G. McLachlan, T. Krishnan, The EM algorithm and extensions (John Wiley & Sons, New York, 2007)
V. Monbet, P. Ailliot, Sparse vector Markov switching autoregressive models. Application to multivariate time series of temperature. Comput. Stat. Data Anal. 108, 40–51 (2017)
G.Y. Muluye, P. Coulibaly, Seasonal reservoir inflow forecasting with lowfrequency climatic indices: a comparison of datadriven methods. Hydrol. Sci. J. 52(3), 508–522 (2007)
E. Moulines, O. Cappè, T. Rydén, Inference in Hidden Markov Models (SpringerVerlag, Cham, 2005)
B. Øksendal, Stochastic differential equations, 6th edn. (Springer, Cham, 2007)
S.W. Phoong, S.Y. Phoong, S.L. Khek, Systematic literature review with bibliometric analysis on Markov switching model: Methods and applications. SAGE Open 12(2), 21582440221093064 (2022)
J. Picaut, F. Du Masia, Y. du Penhoat, An advectivereflective conceptual model for the oscillatory nature of the ENSO. Science 277(5326), 663–666 (1997)
Z. Psaradakis, N. Spagnolo, On the determination of the number of regimes in Markovswitching autoregressive models. J. Time Ser. Anal. 24(2), 237–252 (2003)
Pulford, G.W.: A survey of manoeuvring target tracking methods. arXiv:1503.07828, (2015)
C.P. Robert, G. Casella, Monte Carlo Statistical Methods (Springer, Cham, 2004)
T. Rydén, Estimating the order of hidden Markov models. Stat. J. Theor. Appl. Stat. 26(4), 345–354 (1995)
D. Salinas, V. Flunkert, J. Gasthaus, T. Januschowski, DeepAR: Probabilistic forecasting with autoregressive recurrent networks. Int. J. Forecast. 36(3), 1181–1191 (2020)
C.A. Sims, D.F. Waggoner, T. Zha, Methods for inference in large multipleequation Markovswitching models. J. Econ. 146(2), 255–274 (2008)
R. Stelzer, On Markovswitching ARMA processesstationarity, existence of moments, and geometric ergodicity. Economet. Theor. 25(1), 43–62 (2009)
M.J. Suarez, P.S. Schopf, A delayed action oscillator for ENSO. J. Atmos. Sci. 45(21), 3283–3287 (1988)
C. Wang, A review of ENSO theories. Natl. Sci. Rev. 5(6), 813–825 (2018)
J.F. Yao, J.G. Attali, On stability of nonlinear AR processes with Markov switching. Adv. Appl. Probab. 32(2), 394–407 (2000)
J. Zhang, R.A. Stine, Autocovariance structure of Markov regime switching models and model selection. J. Time Ser. Anal. 22(1), 107–124 (2001)
Zheng, F.: Learning and smoothing in switching Markov models with copulas. PhD thesis, Ecole Centrale de Lyon, (2017)
Funding
This work has been partially supported by the Office of Naval Research (award N000142212647) and Spain’s Agencia Estatal de Investigación (ref. PID2021125159NBI00 TYCHE) funded by MCIN/AEI/10.13039/501100011033 and by “ERDF A way of making Europe”.
Author information
Authors and Affiliations
Contributions
José A. MartínezOrdóñez has contributed to the design of the work, the analysis of data, the creation of new software, and the draft and revision of the manuscript. Javier LópezSantiago has contributed to the acquisition, analysis and interpretation of data, and the revision of the manuscript. Joaquín Míguez has contributed to the conception and design of the work, the analysis of data, and the draft and revision of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that there are no Conflict of interest related to this research.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A
1.1 Proof of Proposition 1
Let \(C_0^q:= \mathbb {E}_{\varvec{x}_0}\left[ \Vert \varvec{x}_0 \Vert ^q \right] < \infty\) and let
Since the random processes \(\{ \varvec{x}_n^{(l)} \}_{n\ge 0}\) are qstable, the constants \(c_\infty ^{l,p}\) are finite. Also, since \(L<\infty\), it follows that \(C_\infty ^q:= \max _{l\in \{1,...,L\}} c_\infty ^{l,q}<\infty\) (because it is a sum of a finite number of finite constants).
From the Markov chain \(\{l_n\}_{n \ge 0}\) we construct a discrete, bivariate random sequence \(\{s_k,r_k\}_{k\ge 1}\) in the following way:

\((s_1,r_1)=(v,m)\in \{1,..., L\} \times \mathbb {N}\) if, and only if, \(l_{1:m}=v\) and \(l_{m+1} \ne v\);

\((s_k,r_k)=(v,m) \in \{1,..., L\} \times \mathbb {N}\) if, and only if, \(l_n=v\) for \(n=\sum _{i=1}^{k1} r_i + 1, \ldots , \sum _{i=1}^k r_i\) and \(l_n \ne v\) for \(n = 1+\sum _{i=1}^k r_i\).
Intuitively, \(\{s_k,r_k\}_{k \ge 1}\) describes how the sequence \(\{l_n\}_{n \ge 0}\) can be split in subsequences of equal consecutive layers (leaving the initial layer, \(l_0\), aside). For example, if \(L=3\) and \(l_{0:7} = \{2, 1, 1, 1, 3, 3, 2, 2\}\) then \((s_1,r_1)=(1,3)\), \((s_2,r_2)=(3,2)\) and \((s_3,r_3)=(2,2)\). The random sequence \(\{s_k,r_k\}_{k\ge 1}\) is measurable w.r.t. the \(\sigma\)algebra generated by the Markov chain \(\{l_n\}_{n\ge 0}\). In particular, if we choose a specific realisation of the Markov chain, and denote it \(\{l_n^*\}_{n\ge 0}\), then we can determine the corresponding realisation of \(\{s_k,r_k\}_{k\ge 1}\), which we denote \(\{s_k^*,r_k^*\}_{k\ge 1}\).
We now use the fixed (but otherwise arbitrary) sequence \(\{s_k^*,r_k^*\}_{k\ge 1}\) to prove, by induction in the index k, that \(\sup _{n \ge 0} \mathbb {E}_{\varvec{x}_n}\left[ \Vert \varvec{x}_n \Vert ^q  l_{0:\infty }^* \right] \le C_\infty ^q \vee C_0^q\), where \(\mathbb {E}_{\varvec{x}_n}[\cdot  l^*_{0:\infty }]\) denotes expectation w.r.t. \(\varvec{x}_n\) conditional on the realisation of the Markov chain \(\{l_m=l_m^*\}_{m\ge 0}\).
For \(k=1\), we have \(l^*_n=s_1^*\) for \(n=1, \ldots , r_1^*\), hence the r.v.’s \(\varvec{x}_{1:r_1^*}\) are generated with the dynamics of layer \(s_1^*\). If we construct a patched version \(\{\bar{\varvec{x}}_n^{(s_1^*)}\}_{n\ge 0}\) of \(\{\varvec{x}_n^{(s_1^*)}\}_{n\ge 0}\), with the patch consisting of the initial state (i.e. \(n_0=0\) and \(\bar{\varvec{x}}_0^{(s_1^*)} = \varvec{x}_0\) in distribution), then it is apparent that \(\varvec{x}_{0:r_1^*} = \bar{\varvec{x}}_{0:r_1^*}^{(s_1^*)}\) in distribution. As a consequence, it follows that
Moreover, the sequence \(\{\varvec{x}_n^{(s_1^*)}\}_{n\ge 0}\) is qstable by assumption, which implies that
where we have used the fact that \(c_\infty ^{l,q} \le C_\infty ^q\) for every \(l=1,..., L\) and \(n_0=0\). Substituting (A2) into (A1) yields
and completes the base case. Note that the constants \(C_\infty ^q\) and \(C_0^q\) are independent of the choice of \(l^*_{0:\infty }\).
For the induction step, assume that
By construction, \(l_n^*=s_k^*\) for \(n=1+\sum _{i=1}^{k1} r_i^*, \ldots , \sum _{i=1}^k r_i^*\). If we choose the patch \(\bar{\varvec{x}}_n^{(s_k^*)} = \varvec{x}_n\) (with equality in distribution) for \(n=0, \ldots , \sum _{i=1}^{k1} r_i^*\) and let \(\bar{\varvec{x}}_n^{(s_k^*)}\) be generated by the \(s_k^*\)th layer for \(n > \sum _{i=1}^{k1} r_i^*\) then the sequence \(\{ \bar{\varvec{x}}_n^{(s_k^*)} \}_{n\ge 0}\) is a patched version of \(\{ \varvec{x}_n^{(s_k^*)} \}_{n\ge 0}\), with \(n_0 = \sum _{i=1}^{k1} r_i^*\), that satisfies the inequality
because \(\varvec{x}_n = \bar{\varvec{x}}_n^{(s_k^*)}\), in distribution, for \(n = 0, \ldots , \sum _{i=1}^k r_i^*\). Moreover, since \(\{\varvec{x}_n^{(s_k^*)}\}_{n\ge 0}\) is qstable, its patched version satisfies the inequality
where \(n_0 = \sum _{i=1}^{k1} r_i\). Again, \(\varvec{x}_n = \bar{\varvec{x}}_n^{(s_k^*)}\), in distribution, for \(n = 0, \ldots , \sum _{i=1}^{k} r_i^*\), hence we can substitute the induction hypothesis (3) into (5) to obtain
where the last inequality follows trivially because \(c_\infty ^{l,q} \le C_\infty ^q\) for every \(l\in \{1, \ldots , L\}\). Substituting (6) back into (4) yields
and completes the induction step. Therefore, (7) holds for every k and it follows that
Next, we note that
and, since the constants \(C_\infty ^q\) and \(C_0^q\) in (8) hold for arbitrary \(l^*_{0:\infty }\), it follows that \(\mathbb {E}_{\varvec{x}_n}\left[ \Vert \varvec{x}_n \Vert ^q \right] \le C_\infty ^q \vee C_0^q\) for every n, i.e.
It remains to show that the patched versions of \(\varvec{x}_n\) also have bounded moments of order q. To that end, simply choose an arbitrary patch \(\bar{\varvec{x}}_{0:m}\) and let \(\{\bar{\varvec{x}}_n\}_{n>m}\) be generated by the cDNARMS(L) model, choose an arbitrary sequence \(l_{m:\infty }^*\) and construct the sequence \(\{s_k^*,r_k^*\}_{k\ge 1}\}\) that starts with \(s_1^*=l_{m+1}^*\). Then, the same induction argument as above, starting with the base case at \(n=m+1\), shows that \(\sup _{n\ge 0} \mathbb {E}\left[ \Vert \bar{\varvec{x}}_n \Vert ^q \right] < C_\infty ^q \vee C_m^q\). \(\square\)
Appendix B
1.1 Accelerated random search algorithm
Let \(f: \mathcal {R}\subseteq \mathbb {R}^d \rightarrow [0,\infty )\) be a real objective function and consider the optimisation problem \(\hat{\boldsymbol{\beta }} \in \arg \max _{\boldsymbol{\beta } \in \mathcal {R}} f(\boldsymbol{\beta })\). The accelerated random search algorithm of [3] (see also [29] for some extensions) is an iterative method for global optimisation that performs a Monte Carlo search on a sequence of balls of varying radius. The algorithm can be outlined as follows:

Initialisation: choose a minimum radius \(r_{\min }>0\), a maximum radius \(r_{\max }>r_{\min }\), a contraction factor \(c>1\) and an (arbitrary) initial solution \(\boldsymbol{\beta }_0\). Set \(r_1 = r_{max}\).

Iteration: denote the solution at \((n1)\)th iteration as \(\boldsymbol{\beta }_{n1}\) and let \(B_n:= \left\{ \boldsymbol{\beta } \in \mathcal {R}: \Vert \boldsymbol{\beta }  \boldsymbol{\beta }_{n1} \Vert < r_n \right\}\). To compute a new solution \(\boldsymbol{\beta }_n\), take the following steps:

1.
Draw \(\tilde{\boldsymbol{\beta }}\) from the uniform probability distribution on \(B_n\).

2.
If \(f(\tilde{\boldsymbol{\beta }}) > f(\boldsymbol{\beta }_{n1})\) then set \(\boldsymbol{\beta }_n = \tilde{\boldsymbol{\beta }}\) and \(r_{n+1} = r_{\max }\). Otherwise, set \(\boldsymbol{\beta }_n = \boldsymbol{\beta }_{n1}\) and \(r_{n+1} = \frac{r_{n}}{c}\).

3.
If \(r_{n+1}<r_{\min }\), then \(r_{n+1}=r_{\max }\)

1.
The algorithm can be iterated a fixed number of times or stopped when \(\boldsymbol{\beta }_n = \boldsymbol{\beta }_{n1} = \ldots = \boldsymbol{\beta }_{nr}\) for a prescribed, sufficiently large \(r>0\).
Appendix C
1.1 Markov switching models with linear AR(K) layers
In Sect. 5 we compare the numerical performance of the proposed cDNARMS(L) model with Markov switching models, also with L layers, where each layer is a linear AR(K) process. These models can be explicitly described as
where \(\{l_n\}_{n\ge 0}\) is a Markov chain taking values in the set \(\{1, \ldots , L\}\) and characterised by the \(L \times L\) transition matrix \(\varvec{M}\), \(\varvec{z}_n\) is a sequence of i.i.d. standard Gaussian r.v.’s and \(\{ a_1[l], \ldots , a_K[l], \sigma [l]: l=, \ldots , L \}\) is the set of \(L(K+1)\) model parameters (i.e. \(K+1\) parameters per layer).
In our computer simulations fit these parameters using the SAEM algorithm of Sect. 3.3, with the peculiarities that (a) there are no unknown delays and (b) the set \(\Lambda _c\) is empty, i.e. the update of the parameters \(\{ a_1[l], \ldots , a_K[l], \sigma [l]: l=, \ldots , L \}\) at each maximisation step can be done exactly.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
MartínezOrdoñez, J.A., LópezSantiago, J. & Miguez, J. Maximum likelihood inference for a class of discretetime Markov switching time series models with multiple delays. EURASIP J. Adv. Signal Process. 2024, 74 (2024). https://doi.org/10.1186/s13634024011668
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13634024011668