# Maximum likelihood inference for a class of discrete-time Markov switching time series models with multiple delays

## Abstract

Autoregressive Markov switching (ARMS) time series models are used to represent real-world signals whose dynamics may change over time. They have found application in many areas of the natural and social sciences, as well as in engineering. In general, inference in this kind of systems involves two problems: (a) detecting the number of distinct dynamical models that the signal may adopt and (b) estimating any unknown parameters in these models. In this paper, we introduce a new class of nonlinear ARMS time series models with delays that includes, among others, many systems resulting from the discretisation of stochastic delay differential equations (DDEs). Remarkably, this class includes cases in which the discretisation time grid is not necessarily aligned with the delays of the DDE, resulting in discrete-time ARMS models with real (non-integer) delays. The incorporation of real, possibly long, delays is a key departure compared to typical ARMS models in the literature. We describe methods for the maximum likelihood detection of the number of dynamical modes and the estimation of unknown parameters (including the possibly non-integer delays) and illustrate their application with a nonlinear ARMS model of El Niño–southern oscillation (ENSO) phenomenon.

## 1 Introduction

### 1.1 Background

Discrete-time autoregressive Markov switching (ARMS) time series models [15] are used to represent real-world signals whose dynamics may change over time. For example, if $$\{x_n\}_{n\ge 0}$$ is the signal of interest, we may model its evolution as

\begin{aligned} x_n = \psi [l_n](x_{0:n-1},u_n,\alpha ), \end{aligned}
(1)

where $$n \ge 0$$ is the current time, $$x_{0:n-1}=\{ x_0, x_1, \ldots , x_{n-1} \}$$ is the signal history, $$\{u_n\}_{n\ge 0}$$ is some noise (random) process, $$\alpha$$ is a vector of parameters and $$\{l_n\}_{n\ge 0}$$ is a Markov chain [40], i.e. a sequence of discrete random indices that change over time according to a Markov kernel that describes the conditional probabilities $$\text {Prob}\left( l_n=i | l_{n-1}=j \right)$$ for suitable integers i and j. For each distinct value $$l_n$$ we have a different map $$\psi [l_n](\cdot ,\cdot ,\cdot )$$ and, hence, the dynamics of $$x_n$$ change with the evolution of the Markov chain. Also, different functions $$\psi [i]$$ and $$\psi [j]$$ ($$i \ne j$$) may depend on different parameters in $$\alpha$$. ARMS models have found a plethora of applications in statistical signal processing for econometrics [8, 9, 13, 19, 43], engineering [16, 17, 28, 39], the geosciences [1, 25, 32], or complex networks [24], to name a few examples.

Inference for ARMS models involves

1. (a)

the detection of the number of possible values that the Markov chain $$\{l_n\}_{n\ge 0}$$ can take,

2. (b)

and the estimation of any unknown parameters contained in $$\alpha$$.

We may assume, without loss of generality, that $$l_n \in \{ 1, \ldots , L \}$$. Following [24], in this paper we refer to each map $$\psi [l]$$, $$1 \le l \le L$$, as a layer and, hence, task (a) consists in estimating the number of active dynamical layers L from a sequence of data samples $$x_{0:n}$$. This is a model order selection problem that can be tackled using the Akaike and Bayesian information criteria (AIC and BIC, respectively) [25, 26, 38, 41], penalised distances [27], penalised likelihoods [32, 34], the three-pattern method [38] or the Hannan-Quinn criterion (HQC) [38]. Linear ARMS models admit an equivalent representation as autoregressive moving-average (ARMA) systems and, in this case, the number of layers L can also be inferred from the covariance matrix of the ARMA process [9, 48].

As for problem (b), maximum likelihood (ML) and maximum a posteriori estimators can be approximated using different forms of the expectation-maximisation (EM) algorithm [1, 15, 32], while Markov chain Monte Carlo (MCMC) methods have been applied for Bayesian estimation [8, 24, 28, 43, 49]. Simpler moment matching techniques can also be applied for parameter estimation in linear ARMS systems [20].

See [15, 36] for a survey of recent ARMS models and methods.

### 1.2 Contributions

In this paper we investigate nonlinear ARMS models where the independent noise process $$\{u_n\}_{n\ge 0}$$ is additive and each dynamical layer depends on a different delay of the signal. Therefore, we particularise Eq. (1) by assuming that the $$l_n$$-th nonlinear map can be written as $$\psi [l_n](x_{0:n-1},u_n,\alpha ) = \phi [l_n](x_{n-1}, x_{n-D[l_n]}, \alpha ) + u_n[l_n],$$ where $$\{l_n\}_{n\ge 0}$$ is a homogeneous Markov chain, each $$D[l] > 1$$ ($$1 \le l \le L$$) is a (possibly long) delay, specific to the l-th dynamical layer, $$\phi [l_n]$$ is a nonlinear function and $$u_n[l_n]$$ is a (layer-specific) noise process. The time series model (1) hence becomes

\begin{aligned} x_n = \phi [l_n](x_{n-1},x_{n-D[l_n]},\alpha ) + u_n[l_n]. \end{aligned}
(2)

A detailed description of the relevant family of models is given in Sect. 2. Our formulation is devised to target time series models that result from the discretisation of delay differential equations (DDEs) [4, 7], which appear often in geophysics [22, 46].

The specific contributions of this work can be summarised as follows:

• We introduce an ARMS time series model that includes systems resulting from the discretisation of stochastic DDEs. In particular, the proposed model includes cases in which the times at which the signal can be observed are not necessarily aligned with the relevant delays (which are often unknown) resulting in discrete-time models with real (non-integer) delays.

• We propose a stability criterion for the new nonlinear ARMS models with multiple delays. In particular, we describe sufficient conditions on the dynamics of the constituent layers that ensure the existence of finite bounds for the moments of the ARMS model up to a given order.

• We provide an EM framework for parameter estimation, based on space alternation and a simple stochastic optimisation algorithm, that can be systematically implemented for the models of the proposed class. This scheme can be easily combined with an ML detector of the number L of dynamical layers.

• We illustrate the application of the proposed model and inference methodology by discretising and then fitting a stochastic DDE which has been proposed as a representation of El Niño–southern oscillation (ENSO) [46]. We obtain numerical results for models with up to four dynamical layers and either integer or real delays. In this example, non-integer delays appear naturally when the observation times are not aligned with the physical delays. We validate the model and estimation algorithm using synthetically generated observations and then apply the methodology to real ENSO data. The prediction accuracy of the proposed nonlinear delayed ARMS model is compared with several deep learning-based time series forecasting models.

### 1.3 Notation

Hereafter, scalar magnitudes are denoted by regular-face letters, e.g. x. Column vectors and matrices are represented by bold-face letters, either lower case or upper case, respectively. For example, $$\varvec{x} = \left( x_1, \ldots , x_m \right) ^\top$$ is an $$m \times 1$$ vector (the superscript $$^\top$$ denotes transposition) and $$\varvec{X} = \left( \varvec{x}_1, \ldots , \varvec{x}_d \right)$$ is an $$m \times d$$ matrix, with $$\varvec{x}_i = \left( x_{1i}, \ldots , x_{mi} \right) ^\top$$ denoting its i-th column. Discrete time is indicated as a subscript, e.g. $$\varvec{x}_n$$. Dependences on an integer index other than time are represented with the index between square brackets, e.g. D[l] in (2) is the delay associated to the l-th dynamical layer.

We often abide by a simplified notation for probability functions, where p(x) denotes the probability density function (pdf) of the random variable (r.v.) x. This notation is argument-wise, hence if we have two r.v.’s x and y, then p(x) and p(y) denote the corresponding density functions, possibly different; p(xy) denotes the joint pdf and p(x|y) is the conditional pdf of x given y. The notation for multidimensional r.v.’s, e.g. $$\varvec{x}$$ and $$\varvec{y}$$, is analogous, i.e. $$p(\varvec{x},\varvec{y})$$, $$p(\varvec{x}|\varvec{y})$$, etc. The probability mass function (pmf) of a discrete r.v. x is denoted P(x) (note the upper case) and we follow the same argument-wise convention as for pdf’s.

### 1.4 Contents

The rest of the paper is organised as follows. In Sect. 2, we introduce the new class of discrete-time, nonlinear ARMS models with multiple delays and introduce a stability criterion. An EM framework for inference is described in Sect. 3, and computer simulation examples are presented in Sect. 4. A case study with real ENSO sea surface temperature anomalies is presented in Sect. 5. Section 6 is devoted to the conclusions.

## 2 Time series models

### 2.1 Delayed nonlinear ARMS time series models

We introduce a nonlinear ARMS time series model with L layers, i.e. L different dynamical modes, each one induced by a different nonlinear map and a different integer delay. An extended model with non-integer delays is described in Sect. 2.2.

Let $$\{\varvec{x}_n\}_{n\ge 0}$$ be a random sequence taking values in $$\mathbb {R}^d$$, and let $$\{l_n\}_{n\ge 0}$$ be a homogeneous Markov chain, taking values on the finite set $$\mathcal {L}= \{1, \ldots , L\}$$, with $$L \times L$$ transition matrix denoted as $$\varvec{M}$$ and initial pmf $$P_0:\mathcal {L}\mapsto [0,1]$$. The entry in the i-th row and j-th column of $$\varvec{M}$$, denoted $$M_{ij}$$, is the probability mass $$P(l_n=j | l_{n-1}=i)$$. The delayed nonlinear ARMS model with L layers and initial value $$\varvec{x}_0$$, denoted DN-ARMS(L), is constructed as

\begin{aligned} \varvec{x}_n = \phi [l_n](\varvec{x}_{n-1}, \varvec{x}_{n - D[l_n]}, \boldsymbol{\alpha }) + \varvec{u}_n[l_n], \quad n \ge 1, \end{aligned}
(3)

where $$\boldsymbol{\alpha } = (\alpha _1, \ldots , \alpha _k )^\top$$ is a $$k \times 1$$ vector of real model parameters, $$\Lambda _D = \{ D[1], \ldots , D[L] \}$$ is a set of positive integer delays, one per dynamical layer, the functions $$\phi [1], \ldots , \phi [L]$$ are distinct $$\mathbb {R}^d \times \mathbb {R}^d \times \mathbb {R}^k \mapsto \mathbb {R}^d$$ nonlinear maps, and $$\{\varvec{u}_n[1]\}_{n\ge 1}, \ldots , \{\varvec{u}_n[L]\}_{n \ge 1}$$ are independent sequences of $$d\times 1$$ i.i.d. random vectors with layer-dependent distinct pdf’s $$p_l(\varvec{u})$$, $$l=1, \ldots , L$$. The model description is complete with a prior pdf $$p(\varvec{x}_{-D^+:-1})$$, where $$D^+ = \max _{l \in \mathcal {L}} D[l]$$ is the maximum delay and $$\varvec{x}_{i:j}$$ denotes the subsequence $$\varvec{x}_i, \varvec{x}_{i+1}, \ldots , \varvec{x}_j$$. Note that a distribution for the r.v. $$\varvec{x}_0$$ is not sufficient to initialise the model because of the delays D[l] (it is often necessary to assume some signal values for $$n<0$$).

### 2.2 Continuous-delay Markov-switch nonlinear models

Model (3) can be extended to incorporate real positive delays. Non-integer delays arise, for example, from the discretisation of stochastic DDEs when the continuous-time delays are not aligned with the time grid of the observed series. Such scenarios are quite natural. For example, in Sect. 4.1 we look into models of ENSO temperature anomalies. These temperatures are typically collected on a monthly basis; however, there is no physical reason for the delays in the DDE models to be an integer number of months.

Assume that $$D[l] \in [1,+\infty )$$ for all $$l \in \mathcal {L}$$. Model (3) can be extended to account for possibly non-integer delays if we rewrite it as

\begin{aligned} \varvec{x}_n = \phi [l_n](\varvec{x}_{n-1}, \tilde{\varvec{x}}_{n - D[l_n]}, \boldsymbol{\alpha }) + \varvec{u}_n[l_n], \quad n \ge 0, \end{aligned}
(4)

where $$\tilde{\varvec{x}}_{n-D[l]}$$ is constructed as an interpolation of consecutive elements of the series $$\{\varvec{x}_n\}_{n\ge 0}$$. In general, for $$\tau \in \mathbb {R}^+$$, we let $$\tilde{\varvec{x}}_{\tau } = \sum _{m=0}^\infty \kappa (\tau -m) \varvec{x}_m,$$ where $$\kappa : \mathbb {R}\mapsto \mathbb {R}$$ is an interpolation kernel satisfying that $$\tilde{\varvec{x}}_\tau = \varvec{x}_\tau$$ when $$\tau$$ is an integer. In the computer experiments of Sects. 4 and 5 we restrict our attention, for simplicity, to the order 1 interpolation

\begin{aligned} \tilde{\varvec{x}}_\tau = (\lfloor \tau \rfloor + 1 - \tau ) \varvec{x}_{\lfloor \tau \rfloor } + (\tau - \lfloor \tau \rfloor ) \varvec{x}_{\lfloor \tau \rfloor + 1}, \end{aligned}
(5)

where, for a real number $$\tau \in \mathbb {R}$$, $$\lfloor \tau \rfloor = \sup \{n \in \mathbb {Z}: n<\tau \}$$.

Let us remark that $$\tilde{\varvec{x}}_{n-D[l_n]}$$ in Eq. (4) is not an observed data point; however, it can be deterministically computed from observed data. Also, model (4) reduces to model (3) when $$D[1], \ldots , D[L]$$ are all integers. We refer to model (4), with real delays, as cDN-ARMS(L).

### 2.3 Stability analysis

#### 2.3.1 Ergodicity of nonlinear ARMS models

While several authors have analysed the properties of specific ARMS-type models [2, 5, 15, 44], general results for discrete-time nonlinear ARMS processes are scarce. Perhaps, the most relevant reference is [47], where Yao and Attali provide sufficient conditions for nonlinear, first-order ARMS models to have an invariant distribution with finite moments. Their analysis deals with systems with no delays or, equivalently in our notation, with the case where $$D[1]=\cdots =D[L]=1$$. As a consequence, the models they analyse can be written as

\begin{aligned} \varvec{x}_n = \phi [l_n](\varvec{x}_{n-1},\boldsymbol{\alpha }) + \varvec{u}_n[l_n], \quad n \ge 0, \end{aligned}

with known paramater vector $$\boldsymbol{\alpha }$$.

A key assumption in the analysis of [47] is that the functions $$\phi [l]$$, $$l=1, \ldots , L$$, need to be either sublinear or Lipschitz with sufficiently small constants. In particular, if the functions are Lipschitz, i.e. they satisfy $$\Vert \phi [l](\varvec{x},\boldsymbol{\alpha }) - \phi [l](\varvec{x}',\boldsymbol{\alpha }\Vert \le c[l] \Vert \varvec{x} - \varvec{x}' \Vert$$ for some constants $$c[l]<\infty$$, $$l=1, \ldots , L$$, then the convergence theorems in [47] rely on the assumption

\begin{aligned} \sum _{l=1}^L P_\infty (l) \log \left( c[l]\right) < 0, \end{aligned}
(6)

where $$P_\infty$$ is the limit distribution of the homogeneous Markov chain $$\{l_n\}_{n\ge 0}$$ with transition matrix $$\varvec{M}$$. Unfortunately, the inequality (6) does not hold when the functions $$\phi [l]$$ result from the discretisation of differential equations, which is the main focus of the cDN-ARMS(L) algorithm.

For example, assume that the l-th layer of an ARMS model is derived from a simple stochastic differential equation [35] of the form

\begin{aligned} \textsf{d}\varvec{x} = f[l](\varvec{x},\boldsymbol{\alpha })\textsf{d}t + \sigma [l] \textsf{d}\varvec{w}, \end{aligned}
(7)

where t denotes continuous time and $$\varvec{w}(t)$$ is a standard multivariate Wiener process. The Euler-Maruyama time discretisation of (7) with time step $$h>0$$ yields

\begin{aligned} \varvec{x}_n = \varvec{x}_{n-1} + h f[l](\varvec{x}_{n-1},\boldsymbol{\alpha }) + \sigma [l]\sqrt{h}\varvec{z}_n, \end{aligned}
(8)

where $$\varvec{z}_n$$ is a sequence of i.i.d. standard Gaussian random vectors. A simple inspection of (8) shows that the corresponding l-th function in the ARMS model becomes

\begin{aligned} \phi [l](\varvec{x}) = \varvec{x} + hf[l](\varvec{x},\alpha ). \end{aligned}

If f[l] is Lipschitz with constant A[l], then $$\phi [l]$$ is Lipschitz with constant $$c[l] = 1 + hA[l]>1$$. Since this is true for $$l=1, \ldots , L$$, then $$\sum _{l=1}^L P_\infty (l)\log \left( c[l]\right) >0$$, which violates the assumptions in the analysis of [47]. The same issue arises in the case of sublinear functions.

#### 2.3.2 Stability of cDN-ARMS(L) models

To avoid the difficulties with the method of [47], we seek a different characterisation of the stability of cDN-ARMS(L) processes. In particular, we aim at finding conditions that guarantee that the random sequence $$\{\varvec{x}_n\}_{n\ge 0}$$ generated by a cDN-ARMS(L) model has finite moments of order $$q \ge 1$$ for all n. To be specific, let $$\Vert \varvec{v} \Vert$$ denote the Euclidean norm of a vector $$\varvec{v}$$ and let $$\mathbb {E}_{\varvec{x}}[ \cdot ]$$ denote expectation w.r.t. the random vector $$\varvec{x}$$ in the subscript. We seek sufficient conditions to ensure $$\sup _n \mathbb {E}_{\varvec{x}_n}[ \Vert \varvec{x}_n \Vert ^q ]<\infty$$ for some $$q \ge 1$$.

Let $$\{\pi _n\}_{n \ge 0}$$ denote a specific family of conditional pdf’s and let $$\varvec{y}_0 \sim \pi _0(\varvec{y}_0)$$ and $$\varvec{y}_n \sim \pi _n(\varvec{y}_n|\varvec{y}_{0:n-1})$$ be the vector random sequence generated by $$\{\pi _n\}_{n \ge 0}$$. If we choose an integer $$n_0 \ge 0$$ and an arbitrary collection of random vectors $$\bar{\varvec{y}}_{0:n_0}$$, then we can construct a new sequence $$\{ \bar{\varvec{y}}_n \}_{n \ge 0}$$ where, for $$n > n_0$$, the random vectors $$\bar{\varvec{y}}_n$$ are generated by the same conditional pdf’s $$\{\pi _{n}\}_{n > n_0}$$, i.e. $$\bar{\varvec{y}}_n \sim \pi _n(\bar{\varvec{y}}_n|\bar{\varvec{y}}_{0:n-1})$$. Let us refer to the new sequence $$\{\bar{\varvec{y}}_n\}_{n \ge 0}$$ as a patched version of the original $$\{\varvec{y}_n\}_{n \ge 0}$$, where the patch is the initial subsequence $$\bar{\varvec{y}}_{0:n_0}$$. We now introduce a notion of stability that is slightly stronger than just requiring bounded moments.

### Definition 1

Let $$\{\varvec{y}_n\}_{n \ge 1}$$ be a sequence of random vectors such that

\begin{aligned} \sup _{n \ge 0} \mathbb {E}_{\varvec{y}_n}\left[ \Vert \varvec{y}_n \Vert ^q \right] \le c_\infty ^q \end{aligned}

for some $$q \ge 1$$ and some constant $$c_\infty ^q < \infty$$. The sequence $$\{\varvec{y}_n\}_{n \ge 1}$$ is q-stable if, and only if, every patched version $$\{ \bar{\varvec{y}}_n \}_{n \ge 0}$$ satisfies

\begin{aligned} \sup _{n \ge 0} \mathbb {E}_{\bar{\varvec{y}}_n}\left[ \Vert \bar{\varvec{y}}_n \Vert ^q \right] \le c_\infty ^q \vee \sup _{n \le n_0} \mathbb {E}_{\bar{\varvec{y}}_n}\left[ \Vert \bar{\varvec{y}}_n \Vert ^q \right] . \end{aligned}

Notation $$a \vee b$$ denotes the maximum between a and b. Intuitively, a random sequence is q-stable when

1. (a)

it has finite moments of order q, and

2. (b)

these moments remain bounded when we force an arbitrary initialisation, as long as these initial vectors (the patch) have bounded moments as well.

The notion of q-stability applies in a straightforward way to the class of cDN-ARMS(L) models given by Eq. (4) and a Markov chain $$\{l_n\}_{n\ge 0}$$. To see it, let us denote as

\begin{aligned} \varvec{x}_n^{(l)} = \phi [l] (\varvec{x}_{n-1}, \tilde{\varvec{x}}_{n-D[l]} ,\boldsymbol{\alpha }) + \varvec{u}_n[l], \quad \text{where}\,\text{l} =1, \ldots , \text{L}, \end{aligned}

the random sequences independently generated by each one of the L layers alone. We can state the following stability result.

### Proposition 1

If $$\mathbb {E}_{\varvec{x}_0}\left[ \Vert \varvec{x}_0 \Vert ^q \right] <\infty$$ and the random sequences $$\{ \varvec{x}_n^{(l)} \}_{n\ge 0}$$, $$l=1, \ldots , L$$, are q-stable, then the corresponding cDN-ARMS(L) model with Markov chain $$\{l_n\}_{n\ge 0}$$ taking values in $$\{1, \ldots , L\}$$ is q-stable as well.

A detailed proof is presented in Appendix A. Note that Proposition 1 does not require the Markov chain $$\{l_n\}_{n\ge 0}$$ to be homogeneous. Also, the condition that all sequences $$\{ \varvec{x}_n^{(l)} \}_{n\ge 0}$$ be q-stable is sufficient, but we conjecture that is not necessary. It seems clear that the model may include layers which are not q-stable yet, if these layers are visited with low probability and/or the Markov chain $$\{l_n\}_{n\ge 0}$$ dwells on those layers for very short periods of time, the overall sequence $$\{\varvec{x}_n\}_{n\ge 0}$$ may still be q-stable. An extended analysis is left for future work.

## 3 Model inference

We introduce a space-alternating (SA) EM algorithm for iterative ML parameter estimation in the general cDN-ARMS(L) model described in Sect. 2.2. First, we obtain the likelihood for the proposed class of models, recall the standard EM method and explain why it is not tractable. We then describe the SA-EM scheme and conclude this section with a succinct discussion on the ML detection of the number L of dynamical layers.

### 3.1 Likelihood function

Let $$\varvec{x}_{0:T}=\{ \varvec{x}_0, \ldots , \varvec{x}_T \}$$ be a sequence of observations (and assume that $$\varvec{x}_n$$ is given for all $$n<0$$). The set of model parameters to be estimated is $$\Lambda = \Lambda _M \cup \Lambda _D \cup \Lambda _\alpha$$, where

\begin{aligned} \Lambda _M = \{ M_{1,1}, \ldots , M_{L,L} \},~~ \Lambda _D = \{ D[1], \ldots , D[L] \}, \text {~and~} \Lambda _\alpha = \{ \alpha _1, \ldots , \alpha _k \} \end{aligned}

are the set of entries of the $$L\times L$$ transition matrix $$\varvec{M}$$, the set of (possibly real) delays and the set of entries of the $$k\times 1$$ vector $$\boldsymbol{\alpha }$$, respectively.

We denote the likelihood of the parameter set $$\Lambda$$ given the observed sequence $$\varvec{x}_{0:T}$$ as $$p(\varvec{x}_{0:T} | \Lambda )$$. In order to obtain an explicit expression for the likelihood we write it in terms of the joint distribution of $$\varvec{x}_{0:T}$$ and the Markov sequence of layers $$l_{0:T} = \{ l_0, \ldots , l_T \}$$, namely

\begin{aligned} p(\varvec{x}_{{0:T}}|\Lambda ) = \sum _{l_{0:T} \in \mathcal {L}^{T+1}} p(\varvec{x}_{0:T},l_{0:T}|\Lambda ). \end{aligned}
(9)

For any $$0 < n \le T$$, we can use the chain rule to obtain a recursive decomposition of the joint distribution,

\begin{aligned} p(\varvec{x}_{0:n}, l_{0:n}|\Lambda ) = p(\varvec{x}_n| \varvec{x}_{0:n-1},l_n,\Lambda ) M_{l_{n-1},l_n} p(\varvec{x}_{0:n-1}, l_{1:n-1}|\Lambda ) \end{aligned}
(10)

where we have used the Markov property and the fact that $$l_n$$ is conditionally independent of $$\varvec{x}_{0:n-1}$$ to show that $$p(l_n|l_{1:n-1},\varvec{x}_{1:n-1},\Lambda ) = M_{l_{n-1},l_n}$$ and $$p(\varvec{x}_n| \varvec{x}_{0:n-1},l_{0:n},\Lambda ) = p(\varvec{x}_n| \varvec{x}_{0:n-1},l_n,\Lambda )$$. If we (repeatedly) substitute the recursive relationship (10) in Eq. (9), we readily obtain

\begin{aligned} p(\varvec{x}_{{0:T}}|\Lambda ) = \sum _{l_{0:T} \in \mathcal {L}^{T+1}} \prod _{n=1}^T p(\varvec{x}_n| \varvec{x}_{0:n-1},l_n,\Lambda ) M_{l_{n-1},l_n} p(\varvec{x}_0|l_0,\Lambda ) P_0(l_0), \end{aligned}
(11)

where $$P_0(l_0)$$ is the initial pmf of the Markov chain. All factors in the expression above can be computed. In particular, note that $$p(\varvec{x}_0|l_0,\Lambda )$$ is tractable because we have assumed that $$\varvec{x}_{-1}, \varvec{x}_{-2}, \ldots$$ are known.

A ML estimator of the parameters is a solution of the problem $$\Lambda _{ML} \in \arg \max _{\Lambda } p(\varvec{x}_{0:T}|\Lambda )$$. However, even when the $$\phi [l]$$’s are linear, it is not possible to compute $$\Lambda _{ML}$$ exactly and we need to resort to numerical approximations [15].

### 3.2 Expectation-maximisation algorithm

Let x and y be r.v.’s, let $$\theta$$ be some parameter and let f(x) be some transformation of x. We write $$\mathbb {E}_{x|y,\theta }[ f(x) ]$$ to denote the expected value of f(x) w.r.t. the distribution with pdf $$p(x|y,\theta )$$, i.e. $$\mathbb {E}_{x|y,\theta }[ f(x) ] = \int f(x) p(x|y,\theta ) \textsf{d}x.$$

A standard EM algorithm [11, 31] for the iterative ML estimation of $$\Lambda$$ from the data sequence $$\varvec{x}_{0:T}$$ can be outlined as follows.

1. 1.

Initialisation: choose an initial (arbitrary) estimate $${\hat{\Lambda }}_0$$.

2. 2.

Expectation step: given an estimate $${\hat{\Lambda }}_i$$, obtain the expectation

\begin{aligned} \mathcal {E}_i(\Lambda ) = \mathbb {E}_{l_{0:T}|\varvec{x}_{0:T}, {\hat{\Lambda }}_i}\left[ \log \left( p(\varvec{x}_{0:T},l_{0:T}|\Lambda ) \right) \right] . \end{aligned}
(12)
3. 3.

Maximisation step: obtain a new estimate,

\begin{aligned} {\hat{\Lambda }}_{i+1} \in \arg \max _{\Lambda } \mathcal {E}_i(\Lambda ). \end{aligned}
(13)

Using standard terminology, $$\varvec{x}_{0:n}$$ is the observed data, $$l_{0:T}$$ is the latent data and $$\{ \varvec{x}_{0:T},l_{0:T} \}$$ is the complete dataset. The basic guarantee provided by the EM algorithm is that the estimates $${\hat{\Lambda }}_0, {\hat{\Lambda }}_1, \ldots$$ have non-decreasing likelihoods, i.e. for every $$i \ge 0$$, $$p(\varvec{x}_{0:T}|{\hat{\Lambda }}_{i+1}) \ge p(\varvec{x}_{0:T}|{\hat{\Lambda }}_i)$$ [31].

If we substitute the decomposition (11) into the expectation of (12) we arrive at

\begin{aligned} \mathcal {E}_i(\Lambda ) = \mathcal {E}_i^0(\Lambda _D \cup \Lambda _\alpha ) + \mathcal {E}_i^1(\Lambda _M) + \log (P_0(l_0)), \end{aligned}
(14)

where

\begin{aligned} \mathcal {E}_i^0(\Lambda _D \cup \Lambda _\alpha ) = \sum _{n=0}^{T} \mathbb {E}_{l_n|\varvec{x}_{0:T},{\hat{\Lambda }}_i}\left[ \log (p(\varvec{x}_n| \varvec{x}_{0:n-1}, l_n, \Lambda _D \cup \Lambda _\alpha )) \right] \end{aligned}

and

\begin{aligned} \mathcal {E}_i^1(\Lambda _M) = \sum _{n = 1}^T \mathbb {E}_{l_{n-1:n}|\varvec{x}_{1:T},{\hat{\Lambda }}_i}\left[ \log (M_{l_{n-1},l_n}) \right] . \end{aligned}

Now, if we write the posterior expectations $$\mathbb {E}_{l_n|\varvec{x}_{0:T},{\hat{\Lambda }}_i}[\cdot ]$$ and $$\mathbb {E}_{l_{n-1:n}|\varvec{x}_{0:T},{\hat{\Lambda }}_i}[\cdot ]$$ explicitly we obtain

\begin{aligned} \hspace{-0.6cm} \mathcal {E}_i^0(\Lambda _D \cup \Lambda _\alpha )= & {} \sum _{n=0}^{T} \sum _{l_n \in \mathcal {L}} \log (p(\varvec{x}_n| \varvec{x}_{0:n-1}, l_n, \Lambda _D \cup \Lambda _\alpha )) P(l_n|\varvec{x}_{0:n},{\hat{\Lambda }}_i), \end{aligned}
(15)
\begin{aligned} \mathcal {E}_i^1(\Lambda _M)= & {} \sum _{n = 1}^T \sum _{l_{n-1} \in \mathcal {L}} \sum _{l_n\in \mathcal {L}} \log (M_{l_{n-1},l_n}) P(l_{n-1},l_n|\varvec{x}_{0:n},{\hat{\Lambda }}_i), \end{aligned}
(16)

where $$P(l_n|\varvec{x}_{0:n},{\hat{\Lambda }}_i)$$ and $$P(l_{n-1},l_n|\varvec{x}_{0:n},{\hat{\Lambda }}_i)$$ are the posterior pmf’s of the random indices $$l_n$$ and $$l_{n-1:n}$$, respectively, conditional on the observations $$\varvec{x}_{0:n}$$ and the i-th parameter estimators $${\hat{\Lambda }}_i$$.

The posterior pmf’s $$P(l_n|\varvec{x}_{0:n},{\hat{\Lambda }}_i)$$ and $$P(l_{n-1},l_n|\varvec{x}_{0:n},{\hat{\Lambda }}_i)$$ can be computed exactly, for every $$n=0, \ldots , T$$, by running a forward-backward algorithm [15, 34]. Given these probabilities, from the sequence of Eqs. (13), (14), (15) and (16) it is apparent that we can updateFootnote 1

\begin{aligned} {\hat{\Lambda }}_{M,i+1} = \arg \max _{\Lambda ^M} \mathcal {E}_i^1(\Lambda _M). \end{aligned}
(17)

Unfortunately, the update of the delays (in the set $${{\hat{\Lambda }}}_{D,i}$$) and the model parameters $$\boldsymbol{\alpha }$$ (in the set $${{\hat{\Lambda }}}_{\alpha ,i}$$) cannot be carried out analytically, i.e. the problem

\begin{aligned} {\hat{\Lambda }}_{D,i+1} \cup {\hat{\Lambda }}_{\alpha ,i+1} \in \arg \max _{\Lambda _D \cup \Lambda _\alpha } \mathcal {E}_i^0(\Lambda _D \cup \Lambda _\alpha ) \end{aligned}

is intractable in general. Moreover, the number of parameters in $$\Lambda _D \cup \Lambda _\alpha$$ is typically large and they may be subject to constraints (e.g. positive parameters, or parameters contained in a certain interval), which makes numerical optimisation hard in practice. As a workaround, we propose a SA-EM algorithm [14] that can be used systematically for the class of models described by (4).

### 3.3 Space-alternating expectation-maximisation algorithm

The intuitive idea is to split the parameters in three sets: one containing the entries of matrix $$\varvec{M}$$ (since we can solve (17) exactly), another one containing the parameters that admit an exact update and a third set containing the parameters for which the update has to be done approximately, using numerical optimisation algorithms. Then, it is possible to cycle through these three sets, updating one or more parameters at a time, while all others are kept fixed. This approach is often termed ‘space-alternating’. It makes algorithm design simpler and still guarantees that the likelihood of the sequence of generated estimates is non-decreasing (as in the standard EM method).

To be specific, let us re-partition the parameter set as $$\Lambda = \Lambda _M \cup \Lambda _* \cup \Lambda _c$$, where $$\Lambda _M$$, $$\Lambda _*$$ and $$\Lambda _c$$ are disjoint sets, and

• $$\Lambda _M$$ contains the $$L \times L$$ entries of the transition matrix $$\varvec{M}$$, as before;

• $$\Lambda _* \subseteq \Lambda _D \cup \Lambda _\alpha$$ contains the parameters that can be updated exactly (when all others are kept fixed);

• and $$\Lambda _c = \left( \Lambda _D \cup \Lambda _\alpha \right) \backslash \Lambda _*$$ contains the remaining parameters, which have to be updated approximately, using numerical optimisation.

With this partition, we ensure the ability to solve the problem

\begin{aligned} {\hat{\Lambda }}_{*,i+1} = \arg \max _{\Lambda _*} \mathcal {E}_i^0(\Lambda _* \cup {\hat{\Lambda }}_{c,i}) \end{aligned}

exactly (note that $$\Lambda _* \cup \Lambda _c = \Lambda _D \cup \Lambda _\alpha$$, hence $$\mathcal {E}_i^0(\Lambda _* \cup {\hat{\Lambda }}_{c,i})$$ is well defined).

We can now summarise the proposed SA-EM algorithm as follows.

1. 1.

Initialisation: choose an initial (arbitrary) estimate $${\hat{\Lambda }}_0 = {\hat{\Lambda }}_{M,0} \cup {\hat{\Lambda }}_{*,0} \cup {\hat{\Lambda }}_{c,0} = {\hat{\Lambda }}_{M,0} \cup {\hat{\Lambda }}_{D,0} \cup {\hat{\Lambda }}_{\alpha ,0}$$.

2. 2.

Expectation step: given an estimate $${\hat{\Lambda }}_i$$, run a forward-backward algorithm to compute the pmf’s $$P(l_n|\varvec{x}_{0:n},{\hat{\Lambda }}_i)$$ for all $$l_n \in \mathcal {L}$$ and $$P(l_{n-1:n}|\varvec{x}_{0:n},{\hat{\Lambda }}_i)$$ for all $$l_{n-1:n} \in \mathcal {L}^2$$. Then construct the expectations $$\mathcal {E}_i^1(\Lambda _M)$$ and $$\mathcal {E}_i^0(\Lambda _*\cup \Lambda _c)$$ in Eqs. (15) and (16), respectively.

3. 3.

Maximisation step: compute

\begin{aligned} {\hat{\Lambda }}_{M,i+1} = \arg \max _{\Lambda _M} \mathcal {E}_i^1(\Lambda _M) \end{aligned}

and denote $$\Lambda _c = \{ \lambda ^c_1, \ldots , \lambda ^c_q \}$$ and $${\hat{\Lambda }}_{c,i} = \{ \lambda ^c_{1,i}, \ldots , \lambda ^c_{q,i} \}$$, where $$q \le L+k$$ is the number of parameters in $$\Lambda _c$$. For $$j = 1, \ldots , q$$, compute

\begin{aligned} {\hat{\lambda }}^c_{j,i+1} \text {~~such that~~} \max _{\Lambda _*} \mathcal {E}_i^0(\Lambda _* \cup {\hat{\Lambda }}_{c,i+1}^{j+}) \ge \max _{\Lambda _*} \mathcal {E}_i^0(\Lambda _* \cup {\hat{\Lambda }}_{c,i+1}^{j-}) \end{aligned}
(18)

where

\begin{aligned} {\hat{\Lambda }}_{c,i+1}^{j+}:= & {} \left\{ {{\hat{\lambda }}}^c_{1,i+1}, \ldots , {{\hat{\lambda }}}^c_{j-1,i+1}, {{\hat{\lambda }}}^c_{j,i+1}, {{\hat{\lambda }}}^c_{j+1,i}, \ldots , {{\hat{\lambda }}}^c_{q,i} \right\} , \text {~~and~~}\\ {\hat{\Lambda }}_{c,i+1}^{j-}:= & {} \left\{ {{\hat{\lambda }}}^c_{1,i+1}, \ldots , {{\hat{\lambda }}}^c_{j-1,i+1}, {{\hat{\lambda }}}^c_{j,i}, {{\hat{\lambda }}}^c_{j+1,i}, \ldots , {{\hat{\lambda }}}^c_{q,i} \right\} , \end{aligned}

to obtain $${\hat{\Lambda }}_{c,i+1} = \left\{ {{\hat{\lambda }}}^c_{1,i+1}, \ldots , {{\hat{\lambda }}}^c_{q,i+1} \right\}$$. Finally, let

\begin{aligned} {\hat{\Lambda }}_{*,i+1} = \arg \max _{\Lambda _*} \mathcal {E}_i^0\left( \Lambda _* \cup {\hat{\Lambda }}_{c,i+1} \right) \end{aligned}

and $${{\hat{\Lambda }}}_{i+1} = {{\hat{\Lambda }}}_{M,i}\cup {{\hat{\Lambda }}}_{*,i+1}\cup {{\hat{\Lambda }}}_{c,i+1}$$.

Intuitively, at the $$(i+1)$$-th iteration of the algorithm we update the parameters in $$\Lambda _M$$ exactly, then we update the parameters $$\lambda _1^c, \ldots , \lambda ^c_q \in \Lambda _c$$ one at a time in (18), and finally we update the parameters in $$\Lambda _*$$ exactly. The 1-dimensional updates in (18) can be numerically carried out in various ways. We suggest the application of the accelerated random search (ARS) method of [3] (see also [29] and Appendix B), which is straightforward to apply and has performed well in our computer experiments (presented in Sect. 4).

The SA-EM algorithm can be run for a fixed, prescribed number of iterations or stopped when some criterion is fulfilled, e.g. that the difference between successive estimates is smaller than a given threshold.

### Remark 1

While bearing similarity, the SA-EM scheme introduced in this paper does not strictly below to the class of algorithms described in [14]. Nevertheless, it is not hard to prove that the estimates $${{\hat{\Lambda }}}_i$$ generated by the SA-EM scheme have non-decreasing likelihoods, i.e. $$p(\varvec{x}_{0:n}|{{\hat{\Lambda }}}_{i+1}) \ge p(\varvec{x}_{0:n}|{{\hat{\Lambda }}}_i)$$ for every $$i \ge 1$$. this is the same property enjoyed by the standard EM method and the generalised algorithms of [14].

### 3.4 Estimation of the number of layers L

When the number of dynamical layers, L, in the cDN-ARMS(L) model is unknown we can also use the proposed SA-EM algorithm to estimate it. In particular, assume that $$L \in \{c^-, \ldots , c^+\}$$, i.e. the number of layers L of the model is at least $$c^-$$ and at most $$c^+$$, so that its estimation can be restricted to a finite set of values. We can run I iterations of the the SA-EM algorithm for each $$L\in \{c^-, \ldots , c^+\}$$ in order to obtain approximate likelihoods

\begin{aligned} p(\varvec{x}_{0:T}|L) \approx p(\varvec{x}_{0:T}|{{\hat{\Lambda }}}_I) =: \ell _T(L), \quad L=c^-, \ldots , c^+, \end{aligned}

where $${\hat{\Lambda }}_I$$ is the set of parameter estimates after I iterations (note that the number of elements in this set increases with L). The likelihood $$p(\varvec{x}_{0:T}|{{\hat{\Lambda }}}_I)$$ above can be computed exactly as a by-product of the forward-backward algorithm run in the maximisation step, with a computational cost $$\mathcal {O}(T)$$ (see [15] for details).

Choosing the value of $$L\in \{c^-,\ldots ,c^+\}$$ that maximises $$\ell _T(L)$$ typically leads to overestimation due to over-fitting. This is illustrated by the computer simulations of Sect. 4.2.1, where it is shown that increasing the number of layers beyond its true value can lead to an increase of the likelihood function $$\ell _T(L)$$ (see Table 1).

Following [38], we adopt a penalised likelihood estimator of the number of layers. In particular, we have run computer experiments with a penalisation of the likelihood of the form $$e^{-C_T |\Lambda |_L}$$, where $$C_T = \frac{1}{2}\log (T)$$ and $$|\Lambda |_L$$ denotes the number of parameters in the set $$\Lambda$$ when the model has L layers. This yields the penalised likelihood $${\tilde{\ell }}_T(L):= e^{-C_T |\Lambda |_L } p(\varvec{x}_{0:T}|{{\hat{\Lambda }}}_I)$$ and the penalised estimator

\begin{aligned} {\hat{L}} = \arg \max _{c^- \le L \le c^+} \left\{ \log {\tilde{\ell }}_T(L)\right\} = \arg \max _{c^- \le L \le c^+} \left\{ \log p(\varvec{x}_{0:T}|{{\hat{\Lambda }}}_I) - C_T |\Lambda |_L \right\} \end{aligned}

## 4 Computer simulations

### 4.1 El Niño–southern oscillation model

El Niño–southern oscillation (ENSO) is a recurring event belonging to a class of climatic phenomena called atmospheric oscillations. It originates from variations in wind intensities in the general atmospheric circulation. These variations cause the oscillation of the thermocline in the Pacific Ocean which, in turn, causes alternating high and low sea surface temperatures (SSTs) on both sides of the ocean. In particular, the El Niño phenomenon corresponds to an increase in SST in the eastern Pacific, which is associated with a strong increase in rainfall intensity and duration in Central and South America. The anomalies in the trade wind and the SST at the Pacific Ocean’s equator have been historically modelled as the solution of a DDE [37, 45, 46]. Several conceptual equations have also been proposed in the literature [18, 21] that incorporate stochastic terms or display chaotic dynamics.

For the computer experiments in this section we consider a nonlinear DDE based on the model of Ghil et al. [18], where we include a diffusion term to obtain the stochastic DDE in Itô form

\begin{aligned} \textsf{d}\mathcal {T}= \left[ b \cos ({ 2\pi \omega }t) - a \tanh (\kappa \mathcal {T}(t-\tau ) \right] \textsf{d}t + \sigma \textsf{d}W, \end{aligned}
(19)

where t denotes continuous time (the time unit is 1 year), $$\mathcal {T}(t)$$ is the SST anomaly, $$\tau \in \mathbb {R}^+$$ is a time delay, and a, b, $$\kappa$$ and $$\omega$$ are constants, W(t) is a standard Wiener process and $$\sigma >0$$ is a diffusion coefficient that determines the intensity of the stochastic perturbation.

Equation (19) can be integrated numerically using different schemes [7]. For simplicity, we apply an Euler-Maruyama scheme with constant time-step $$h>0$$, which yields

\begin{aligned} \mathcal {T}_n = \mathcal {T}_{n-1} + \left[ b \cos \left( 2\pi \omega h(n-1) \right) - a \tanh \left( \kappa \mathcal {T}_{n-1-D}) \right) \right] h + \sqrt{h} \sigma u_n, \end{aligned}
(20)

where $$\mathcal {T}_n \approx \mathcal {T}(nh)$$ is an approximation of the SST anomaly process at $$t=nh$$, D is a discrete delay computed as $$D=\frac{\tau }{h}$$ (we assume that $$\tau$$ can be expressed as an integer multiple of h) and $$u_n$$ is an i.i.d. sequence of $$\mathcal {N}(0,1)$$ (standard Gaussian) r.v.’s. Note that the drift in Eq. (20) is periodic, hence $$\mathcal {T}_n$$ has bounded second order moment (provided that h and $$\sigma$$ are sufficiently small).

Model (19) and its discretised version (20) can be shown to yield sequences of temperatures which are (qualitatively and quantitatively) similar to the measurements of SST anomalies in the South Pacific ocean. However, these models are not accurate enough for reliable forecasting and, in particular, it has not been possible to use them to predict the large “spikes” in SST that characterise the El Niño phenomenon.

In an attempt at (a) extending the applicability of model (19) and (b) illustrating the proposed multi-layer modelling approach, we address the construction of multi-layer cDN-ARNMS(L) models where each of the L layers corresponds to a difference equation of the form in (20) with its own set of static parameters $$\{ a[l],b[l],\kappa [l],\omega [l],\sigma [l]\}$$, $$l \in \{1, \ldots , L\}$$. Such multi-layer model also requires an $$L\times L$$ transition matrix $$\varvec{M}$$ that describes the Markov switching mechanism.

### 4.2 DN-ARMS(L) model with integer delays

Let us construct a DN-ARMS(L) model of the form in Eq. (3) for SST anomaly time series. Publicly available SST data for ENSO is given in the form of monthly averaged SSTs and SST anomalies. For this reason, we adopt $$h=\frac{1}{12}$$ as a time step in the Euler scheme (20). This time step is known and common to all L layers of the DN-ARMS(L) model. Additionally, we define:

• A Markov chain $$\{l_n\}_{n\ge 0}$$ taking values in $$\mathcal {L}=\{1, \ldots , L\}$$ with an $$L \times L$$ transition matrix $$\varvec{M}$$.

• A set of delays $$\tau [1], \ldots , \tau [L]$$, one per layer. We assume that these delays are integer multiples of $$h=\frac{1}{12}$$; hence, they turn into integer delays $$D[1]=\frac{\tau [1]}{h}, \ldots , D[L]=\frac{\tau [L]}{h} \in \mathbb {N}$$.

• Parameters $$\{ a[l], b[l], \kappa [l], \omega [l], \sigma [l] \}$$ for each $$l \in \mathcal {L}$$; hence, the parameter vector $$\boldsymbol{\alpha } = \left( a[1], b[1], \ldots , a[L], b[L], \kappa [L], \omega [L], \sigma [L] \right) ^\top$$ is $$5\,L \times 1$$ (i.e. $$k=5\,L$$ in the general model (3)).

The assumption $$D[l]=\frac{\tau [l]}{h} \in \mathbb {N}$$ for every l may be a mild one when h can be chosen to be sufficiently small; however, it is hardly realistic for a time step of one month (and we drop it in Sect. 4.3).

The resulting DN-ARMS(L) model of the ENSO time series can be compactly written as

\begin{aligned} l_n\sim & {} P(l_n|l_{n-1}) = M_{l_{n-1},l_n}, \nonumber \\ x_n= & {} x_{n-1} + h\left[ b[l_n] \cos \left( 2\pi \omega [l] h(n-1) \right) - a[l_n] \tanh \left( \kappa [l_n] x_{n-D[l_n]}) \right) \right] \nonumber \\{} & {} + \sqrt{h} \sigma [l_n] u_n, \quad n \ge 0, \end{aligned}
(21)

where $$x_n$$ is the SST anomaly at time $$t=nh$$ and $$\{u_n\}_{n\ge 0}$$ is an i.i.d. $$\mathcal {N}(0,1)$$ sequence. We assume $$x_n \sim \mathcal {N}(0,1)$$ for all $$n \le 0$$.

#### 4.2.1 Estimation of the number of layers L

In the first set of computer experiments, we assess the estimation of the number of layers L using the penalised maximum likelihood method described in Sect. 3.4. We simulate two datasets $$x_{0:T}^{(2)}$$ and $$x_{0:T}^{(3)}$$ with true values, $$L_o=2$$ and $$L_o=3$$, respectively, and $$T=1,000$$. Then, for each $$L \in \{2,3,4\}$$ we run the SA-EM algorithm of Sect. 3.3. Since EM algorithms converge only locally [30], for each $$L_o$$ and each L we run the SA-EM scheme $$R=50$$ times, with the same dataset but different, independently generated initial parameter estimates $${\hat{\Lambda }}^{(r)}_{0}, r = 1,\ldots ,50$$. Recall that the SA-EM algorithm yields the likelihood as a by-product (see Sect. 3.4). We denote the likelihood of the model with L layers in the r-th simulation as $$\ell _{T}^{(r)}(L)$$ and then assign the ML over the R independent simulations, i.e. $$\ell _T(L) = \max _{1\le r \le R} \ell _{T}^{(r)}(L)$$.

The true transition matrices are

\begin{aligned} \varvec{M}= \left[ \begin{array}{cc} 0.6 &{}0.4\\ 0.3 &{}0.7\\ \end{array} \right] \quad \text {and} \quad \varvec{M}= \left[ \begin{array}{ccc} 0.5 &{} 0.3 &{} 0.2\\ 0.2 &{} 0.3 &{} 0.5\\ 0.2 &{} 0.6 &{} 0.2 \end{array} \right] \end{aligned}

for $$L_o=2$$ and $$L_o=3$$, respectively. The remaining parameters are

\begin{aligned} \begin{array}{lll} a[1:2] = (10,1)^\top , &{}\kappa [1:2] = (3,1)^\top , &{} b[1:2] =(10,1)^\top , \\ w[1:2] = (1/12,1/3)^\top , &{}\sigma [1:2] = (0.3,0.1)^\top , &{} D[1:2] = (5,15)^\top , \end{array} \end{aligned}

for $$L_o=2$$ and, for $$L_o=3$$,

\begin{aligned} \begin{array}{lll} a[1:3] = (10,1,2)^\top , &{}\kappa [1:3] = (3,2,1)^\top , &{} b[1:3] =(5,1,3)^\top , \\ w[1:3] = (1/12,1/3,1/5)^\top , &{}\sigma [1:3] = (0.4,0.2,0.1)^\top , &{} D[1:3] = (5,10,18)^\top . \end{array} \end{aligned}

Table 1 shows the log-likelihoods and the penalised log-likelihoods obtained for the models with $$L = 2,3,4,$$ when the true model generating the data has $$L_o=2$$ layers (left) and $$L_o=3$$ layers (right). It is seen that higher likelihoods ($$\ell _T(L)$$) are obtained as L is increased, even when $$L>L_o$$. This is due to over-fitting of the model parameters. When a simple penalisation is included ($$\tilde{\ell }_T(L)$$), the correct number of dynamical layers is detected in both experiments.

#### 4.2.2 Parameter estimation

In this section, we study the accuracy of the SA-EM parameter estimation algorithm with $$L_o=2$$ layers described in Sect. 4.2.1 and aim at estimating the transition matrix $$\varvec{M}$$ as well as the integer delays $$D[1:2] = \left( D[1], D[2] \right) ^\top$$ and the model parameters $$\boldsymbol{\alpha } = \left( a[1:2],b[1:2],\kappa [1:2], w[1:2], \sigma [1:2] \right) ^\top$$. We assess the accuracy of the estimators of real parameters in terms of normalised errors. Assume we run R independent simulations, all with the same true parameters (but independently generated observations $$x_{0:T}$$). Then, the normalised estimation errors for the transition matrix $$\varvec{M}$$ are

\begin{aligned} e_{\varvec{M}}(r):= \Vert {\hat{\varvec{M}}}^{(r)}-{\varvec{M}}\Vert _{F} / \Vert \varvec{M}\Vert _{F}, \quad r=1, \ldots , R, \end{aligned}

where $$\Vert \cdot \Vert _{F}$$ denotes the Frobenius norm and $${\hat{\varvec{M}}}^{(r)}$$ represents the estimate of $$\varvec{M}$$ in the r-th independent simulation. For the parameters in $$\boldsymbol{\alpha }$$, the normalised errors are computed as

\begin{aligned} e_{\alpha _i}(r) = |{\hat{\alpha }}_i^{(r)}-\alpha _i| / |\alpha _i| \quad r=1, \ldots , R, \end{aligned}

where $$\alpha _i$$ denotes the i-th entry of vector $$\boldsymbol{\alpha }$$ and $${\hat{\alpha }}_i^{(r)}$$ is its estimate in the r-th independent simulation.

Since the delays D[1 : 2] are integers, calculating a normalised Euclidean norm does not provide a meaningful characterisation of performance. Instead, we assess the estimation algorithm by computing the frequency of correct detections, i.e. if $${\hat{D}}[l]^{(r)}$$ is the estimate of the delay D[l] in the r-th independent simulation, for $$r=1, \ldots , R$$, then the frequency of (correct) detections is $$F_D:= \sum _{r=1}^R \delta \left[ \sum _{l=1}^{L} |{\hat{D}}[l]^{(r)} - D[l]| \right]$$, where $$\delta [\cdot ]$$ is the Kronecker delta function.

The computer experiment consists of the following steps:

1. i)

Generate $$R=100$$ independent realisations of the time series model with $$L_o=2$$ described in Sect. 4.2.1. Each signal consists of $$T=1,000$$ data points. A sample signal is displayed in Fig. 1a.

2. ii)

For each independent realisation, extract four subsequences containing the first 250, 500, 750 and 1,000 data points, respectively.

3. iii)

Generate an initial condition for each realisation, $${\hat{\Lambda }}_0^{(r)}$$, $$r=1,..., R$$. Apply the SA-EM algorithm to each subsequence of each realisation and obtain parameter estimates.

4. iv)

For each $$r=1,..., R$$ and each subsequence, compute the normalised errors for the transition matrix ($$e_{\varvec{M}}(r)$$) and each real parameter ($$e_{\alpha _i}(r)$$), as well as the detection frequency $$F_D$$ for the integer delays.

The purpose of this setup is to demonstrate the effectiveness of the SA-EM algorithm, study the existence of local maxima of the likelihood and illustrate the improvement in the accuracy of the parameter estimates as the length of the observed series increases.

Figure 1b shows a bar diagram with the absolute frequency of correct detection, $$F_D$$, for each data sequence length. Since there are $$R=100$$ simulations (for each length), the maximum value of $$F_D$$ is $$R=100$$. We see that, as the length of the data sequence increases, the value of $$F_D$$ improves as well. When the number of data points in the sequence is $$T=1,000$$ we obtain $$F_D \approx 95$$, i.e. D[1] and D[2] are both detected correctly in $$\approx 95\%$$ of the simulations. The delays become mismatched in the simulation runs where the SA-EM algorithm converges to a local maximum of the likelihood.

Figures 2 and 3 show box plots of the normalised errors for the remaining parameters. For each parameter and each length of the data sequence, the horizontal line in each box indicates the median of the errors, the box extends between the 0.25 and 0.75 quantiles of the empirical distribution, the whiskers extend to the complete distribution and the circles are outliers (i.e. points located above the upper quartile by 1.5 times the interquartile range). We observe how the median error decreases when more data points are available. As with the delays, outliers are due to simulations where the SA-EM algorithm converges to a local maximum of the likelihood that differs significantly from its global maximum. These simulations indicate that, in any practical application, the SA-EM algorithm should be run with multiple initialisations (even for a single dataset). One can then select the parameter estimates from the run that attained the highest log-likelihood (which is computed by the SA-EM algorithm as a by-product of the forward-backward procedure).

### 4.3 cDN-ARMS(L) model with non-integer delays

The assumption of integer delays, i.e. that $$\tau [l]$$, $$l=1, \ldots , L$$, are all integer multiples of the one-month time step $$h=\frac{1}{12}$$, is unrealistic. In this section, we assume that $$D[l] = \frac{\tau [l]}{h} \in (1,+\infty )$$ and construct a cDN-ARMS(L) model of the form in Eq. (4) for the ENSO time series. To be specific, the SA-EM algorithm is implemented with the model

\begin{aligned} l_n\sim & {} P(l_n|l_{n-1}) = M_{l_{n-1},l_n}, \nonumber \\ x_n= & {} x_{n-1} + h\left[ b[l_n] \cos \left( 2\pi \omega [l] h(n-1) \right) - a[l_n] \tanh \left( \kappa [l_n] {\tilde{x}}_{n-D[l_n]}) \right) \right] \nonumber \\{} & {} + \sqrt{h} \sigma [l_n] u_n, \quad n \ge 0, \end{aligned}
(22)

where $${\tilde{x}}_{n-D[l_n]}$$ is computed by linear interpolation as shown in Eq. (5). The transition matrix $$\varvec{M}$$ and the parameters $$\boldsymbol{\alpha }$$ are defined in the same way as in Sect. 4.2, and $$x_n \sim \mathcal {N}(0,1)$$ for all $$n<0$$.

#### 4.3.1 Generation of synthetic time series

The interpolated signal $${{\tilde{x}}}_{n-D[l]}$$ in (22) is an approximation that we impose to incorporate a real delay into a discrete-time model. In order to put this approximation to a test, we generate time series data using stochastic DDEs of the form in Eq. (19) that we integrate with an Euler-Maruyama scheme on a finer grid, namely, with a time step of the form $$h=\frac{1}{12\,m}$$, where m is a positive integer.

In particular, we let again $$L_o=2$$ and generate an auxiliary data sequence from the model

\begin{aligned} y_i= & {} y_{i-1} + \frac{1}{12m}\left[ b[l_i] \cos \left( 2\pi \omega [l_i] \frac{12(i-1)}{m} \right) - a[l_i] \tanh \left( \kappa [l_i] x_{n-\mathcal {D}[l_i]}) \right) \right] \\{} & {} + \sqrt{\frac{1}{12m}} \sigma [l_i] u_i, \quad i \ge 0, \end{aligned}

where the $$\mathcal {D}[l]$$’s are positive integer delays, $$\{u_i\}_{i\ge 0}$$ is a standard Gaussian i.i.d. sequence of noise variables, $$y_i \sim \mathcal {N}(0,1)$$ for all $$i<0$$, $$l_i \sim P(l_i|l_{i-1})=M_{l_{i-1},l_i}$$ when i is an integer multiple of m and $$l_i=l_{i-1}$$ when i is not an integer multiple of m (i.e. the index $$l_i$$ can only change every m time steps).

The actual dataset used for the computer simulations is then obtained by subsampling $$\{y_i\}_{i\ge 0}$$ by a factor m, namely,

\begin{aligned} x_n = y_{nm}, \quad \text {for} \quad n = 0, 1, \ldots \end{aligned}

If $$m=2$$, the delays $$\mathcal {D}[l]$$, which are integers in the discrete-time scale of $$\{y_i\}_{i\ge 0}$$, become rational in the discrete-time scale of $$\{x_n\}_{n\ge 0}$$, with possible values of the form $$D[l] \in \left\{ r, r \pm \frac{1}{2} \right\}$$, where $$r \in \mathbb {Z}^+$$. For general $$m \in \mathbb {Z}^+$$, the delays in the subsampled time scale of the series $$\{ x_n \}_{n\ge 0}$$ are of the form $$D[l] \in \left\{ r, r + \frac{1}{m}, \ldots , r+\frac{m-1}{m} \right\}$$, with $$r\in \mathbb {Z}^+$$. In this way, we generate a data sequence $$\{x_n\}_{n\ge 0}$$ that depends on non-integer delays and does not rely on interpolation or any other approximation based on the observed data.

#### 4.3.2 Parameter estimation

We conduct a set of computer experiments similar to those in Sect. 4.2.2 but using data sequences generated by the procedure in Sect. 4.3.1 to account for non-integer delays. The estimation errors for $$\varvec{M}$$ and the parameters in vector $$\boldsymbol{\alpha }$$ are computed in the same way as in Sect. 4.2.2. The estimates of the delays for these experiments are real numbers, hence we compute normalised errors of the form $$e_{D[l]}(r) = |D[l]-{\hat{D}}[l]^{(r)}|/|D[l]|$$, where D[l] is the true value of the delay and $${\hat{D}}[l]^{(r)}$$ is the estimate in the r-th simulation run.

The procedure for the computer experiments is the same as in Sect. 4.2.2, with $$R=100$$ independently generated data sequences of length $$T=1,000$$, each of them split to obtain subsequences of length 250, 500, 750 and 1,000. The values of the true parameters $$\boldsymbol{\alpha }$$ and transition matrix $$\varvec{M}$$ used to generate the data are the same as in the model with $$L_o=2$$ in Sect. 4.2.1. The auxiliary sequence $$\{y_i\}_{i\ge 0}$$ is generated with a time step $$\frac{1}{12m}$$, with $$m=2$$. The true non-integer delays are $$D[1]=3.5$$ and $$D[2]=9.5$$.

Figures 4 and 5 show the box plots of the normalised errors for all parameters. We note that the errors are grouped per parameter through the two layers (similar to Figs. 2 and 3). We observe that the median errors decrease consistently as the length of the data sequence increases, except for the real delays D[1:2]. In this case we see that the normalised errors are small (close to 10 ) and their variance reduces with increasing data length, but the median error remains approximately constant. This numerical result indicates that the estimators of D[1] and D[2] present a bias due to the mismatch between model used by the SA-EM algorithm, and the model used to generate the data.

## 5 Experimental results

### 5.1 Data and models

After validating the performance of the SA-EM inference algorithm with synthetic data in Sect. 4, we now tackle the fitting of cDN-ARMS(L) models using real ENSO data. The dataset consists of four time series of monthly SST anomalies, each corresponding to a different zone of the Pacific Ocean and each consisting of $$T=1,848$$ monthly observations, starting in January 1870 and up to December 2023. The series are labelled ENSO 1+2, ENSO 3, ENSO 3+4 and ENSO 4. Figure 6 shows the evolution of the SST anomalies in the 12-year period between January 2012 and December 2023. The ‘peaks’ in December 2015 and December 2023 are the most recent El Niño events.

We have applied the EM-SA algorithm on each one of these datasets in order to fit different models:

• cDN-ARMS(L) models with $$L=2, 3, 4$$, and

• a Markov switching model with $$L=4$$ layers where each layer is a standard linear AR(4) system (as described, e.g. in [10, Chapter 4]), labelled ARMS(4,4) in the figures and tables of this section.

The specific form of the Markov switching system based on linear AR processes is briefly described in Appendix C. All four models are fitted using the SA-EM algorithm. For the cDN-ARMS(L) models the algorithm is applied in the same way as in Sect. 4.3. For the models with linear AR(4) layers, the algorithm simplifies considerably as $$\Lambda _c = \emptyset$$, which implies that there is no need for the ARS optimisation scheme. The overall estimation procedure becomes very similar to the one in [15] (except that the models in [15] are of order 1). Note that the models are fitted for each specific dataset.

In addition to the Markov switching models, we have also used the ENSO datasets to train four deep learning-based schemes for time series forecasting. In particular,

• a gated recurrent unit (GRU) neural network (NN) [23],

• a multi-layer perceptron (MLP) NN [33],

• long short-term memory (LSTM) NN [6], and

• the DeepAR model [42].

We have trained each model with a learning rate of 0.001, 50 epochs and time window of the 12 previous months. Otherwise we have selected the combinations of layers and neurons that yielding the best numerical results in our computer experiments. All models are implemented in Python. The GRU, MLP, and LSTM NNs are built using the Keras library (https://keras.io), while the DeepAR model is implemented using the GluonTS library (https://ts.gluon.ai/stable/index.html).

We have used the data spanning from January 1870 to December 1991 ($$T_{train} = 1,464$$ observations) in order to train the deep learning models and fit the Markov switching schemes. The remaining data, from January 1992 to December 2023 ($$T_{test} = 384$$ observations) has served as the test set to assess the prediction performance.

### 5.2 Autocorrelation function

Fig. 7 shows the empirical autocorrelation functions computed using the four ENSO datasets and synthetic data generated with the cDN-ARMS(4), ARMS(4,4) and DeepAR models fitted to each one of the four ENSO xones. Note that the GRU, MLP and LSTM models are non-generative (they perform a deterministic transformation of their inputs), hence they are not displayed in this comparison.

The three generative models perform relatively well for autocorrelation lags up to 6–7 months in zones 1+2, 3 and 3+4. For ENSO 4, cDN-ARMS(4) and deepAR underestimate the autocorrelation, while ARMS(4,4) overestimates it. After 6–7 months the discrepancies are larger, with cDN-ARMS(4) and ARMS(4,4) typically overestimating the autocorrelation. These curves are an average of the empirical autocorrelations of 2,000 independently generated series for each model.

### 5.3 Forecasting

The forecasting task involves the prediction of the SST anomalies with a lead time of k months. For a given lead time k and a given algorithm, we evaluate the prediction root-mean square error (RMSE) and the Pearson correlation coefficient (PCC), given by

\begin{aligned} \text {RMSE}_k = \sqrt{\frac{1}{T_{test}} \sum _{n=1}^{T_{test}} (x_n - \hat{x}_n^k)^2} \end{aligned}

and

\begin{aligned} \text {PCC}_k = \frac{{\sum _{n=1}^{T_{test}} (x_n - \bar{x})(\hat{x}_n^k - \bar{\hat{x}}^k)}}{{\sqrt{\sum _{n=1}^{T_{test}} (x_n - \bar{x})^2} \sqrt{\sum _{n=1}^{T_{test}} (\hat{x}_n - \bar{\hat{x}}^k)^2}}}, \end{aligned}

where $$x_{1:T_{test}}$$ is the signal from the first to the last month of the test period, $$\hat{x}_n^k$$ is the forecast with lead time of k months, $$\bar{x}$$ is the mean of $$x_n$$ and $$\bar{\hat{x}}^k$$ is the mean of $$\hat{x}_n^k$$. Better performance is achieved for smaller RMSE$$_k$$ and larger PCC$$_k$$. For The predictions for cDN-ARMS(4) and ARMS(4,4) are computed using a standard particle filter (whose state is the Markov chain $$l_n \sim \varvec{M}$$) [12] with $$N=500$$ particles, while for DeepAR the prediction $${\hat{x}}_n$$ is the mean over $$N=500$$ generated sequences.

Tables 2, 3, 4 and 5 display the PCCs and RMSEs and PCCs of each model and lead times $$k=1, 3, 6$$ and 9 months for ENSO 1+2, ENSO 3, ENSO 3+4 and ENSO 4, respectively. For lead times of $$k=1$$ and $$k=3$$ months, cDN-ARMS(4) achieves the best or second-best performance across all zones both for RMSE and PCC. Its relative performance deteriorates for larger lead times (6 and 9 months). While both the RMSE and PCC values remain to the other methods, GRU is the best-performing for ENSO 1+2, DeepAR attains the best performance for ENSO 3 and ENSO 3+4, while LSTM and DeepAR are best for ENSO 4.

Finally, Fig. 8 provides a graphical illustration of the forecast for ENSO 3+4 from January 2012 to December 2023, with lead times of 1, 3, 6 and 9 months for Fig. 8a–d, respectively. The figures show the true SST, the mean forecast and the $$\pm 3\sigma$$ interval around the mean forecast. We see how predictions are accurate, with relatively small uncertainty and capture well the December 2015 and December 2023 El Niño events. For 6-month and 9-month forecast the uncertainty is larger. The December 2023 El Niño peak is still within the shaded area, but the actual December 2015 peak occurs earlier than predicted.

## 6 Conclusions

We have introduced a class of nonlinear autoregressive Markov-switching time series models where each dynamical layer (or sub-model) may have a different, possibly non-integer delay. This class includes a broad collection of systems that result from the discretisation of stochastic DDEs where the characteristic delays are not a priori known. Such models are common in geophysics.

The proposed family of models does not necessarily admit an asymptotic-regime analysis similar to the classical results of [47] for first-order Markov-switching systems. Instead, we have identified a certain stability property of the distinct dynamic regimes in the switching model that guarantees that the signals generated by the proposed model have bounded moments up to a given order. We have also introduced numerical methods, based on a space-alternating EM procedure, to detect the number of dynamical layers in the model and to compute ML estimators of any unknown parameters, including the multiple, possibly non-integer delays. The performance of these inference methods has been tested on nonlinear autoregressive Markov-switching models that combine up to four dynamical layers, each one of them originating from a DDE typically used to represent anomalies in the sea surface temperature of (certain regions of) the Pacific Ocean due to the ENSO phenomenon. Real-world ENSO data are recorded as a monthly series, yet the delays that characterise the phenomenon are not known a priori and there is no physical reason for them to be integer multiples of a month. Our computer simulations show that it is possible to detect the number of layers and estimate the parameters of the models using relatively short series of observations. The cDN-ARMS(L) models fitted using real ENSO data can also be used to forecast strong positive anomalies in the sea surface temperature (El Niño phenomenon). We show a comparison of the predictive capability of the proposed cDN-ARMS(L) scheme with several deep-learning-based time series forecasting models.

## Availability of data and materials

The datasets analysed during the current study are available in the Climate Data Guide repository of the US National Center for Atmospheric Research (NCAR). URL: https://climatedataguide.ucar.edu/climate-data.

## Notes

1. We denote $${\hat{\Lambda }}_{i+1} = {\hat{\Lambda }}_{M,i+1} \cup {\hat{\Lambda }}_{D,i+1} \cup {\hat{\Lambda }}_{\alpha ,i+1}$$.

## References

1. P. Ailliot, V. Monbet, Markov-switching autoregressive models for wind time series. Environ. Modell. Softw. 30, 92–101 (2012)

2. A. Aknouche, C. Francq, Stationarity and ergodicity of Markov switching positive conditional mean models. J. Time Ser. Anal. 43(3), 436–459 (2022)

3. M.J. Appel, R. Labarre, D. Radulovic, On accelerated random search. SIAM J. Optim. 14(3), 708–730 (2003)

4. A. Bellen, Marino Zennaro Numerical methods for delay differential equations. (Oxford University Press, Oxford, 2013)

5. A. Bibi, A. Ghezal, On the Markov-switching bilinear processes: stationarity, higher-order moments and β-mixing. Stoch. Int. J. Probab. Stoch. Process. 87(6), 919–945 (2015)

6. C. Broni-Bedaiko, F.A. Katsriku, T. Unemi, M. Atsumi, J.-D. Abdulai, N. Shinomiya, E. Owusu, El niño-southern oscillation forecasting using complex networks analysis of lstm neural networks. Artif. Life Robot. 24, 445–451 (2019)

7. E. Buckwar, Introduction to the numerical analysis of stochastic delay differential equations. J. Comput. Appl. Math. 125(1–2), 297–307 (2000)

8. R. Casarin, D. Sartore, M. Tronzano, A Bayesian Markov-switching correlation model for contagion analysis on exchange rate markets. J. Bus. Econ. Stat. 36(1), 101–114 (2018)

9. M. Cavicchioli, Determining the number of regimes in Markov switching VAR and VMA models. J. Time Ser. Anal. 35(2), 173–186 (2014)

10. J.D. Cryer, K.-S. Chan, Time series analysis: with applications in R, vol. 2 (Springer, Berlin, 2008)

11. A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. Ser. B 39(1), 1–38 (1977)

12. P.M. Djurić, J.H. Kotecha, J. Zhang, Y. Huang, T. Ghirmai, M.F. Bugallo, J. Míguez, Particle filtering. IEEE Signal Process. Mag. 20(5), 19–38 (2003)

13. Fernandez, M.F.: Modelling Volatility with Markov-Switching GARCH Models. PhD thesis, The University of Liverpool, United Kingdom (2022)

14. J.A. Fessler, A.O. Hero, Space-alternating generalized expectation-maximization algorithm. IEEE Trans. Signal Process. 42(10), 2664–2677 (1994)

15. Franke, J.: Markov switching time series models. In T. Subba Rao, S.Subba Rao, and C.R. Rao, editors, Time Series Analysis: Methods and Applications, pp. 99–122. Elsevier, (2012)

16. C. Fritsche, A. Klein, F. Gustafsson, Bayesian Cramer-Rao bound for mobile terminal tracking in mixed LOS/NLOS environments. IEEE Wirel. Commun. Lett. 2(3), 335–338 (2013)

17. X. Fu, Y. Jia, J. Du, F. Yu, New interacting multiple model algorithms for the tracking of the manoeuvring target. IET Control Theory Appl. 4(10), 2184–2194 (2010)

18. M. Ghil, I. Zaliapin, S. Thompson, A delay differential model of ENSO variability: parametric instability and the distribution of extremes. Nonlinear Process. Geophys. 15(3), 417–433 (2008)

19. P. Guérin, M. Marcellino, Markov-switching MIDAS models. J. Bus. Econ. Stat. 31(1), 45–56 (2013)

20. S. Höcht, K.H. Ng, J. Wiesent, R. Zagst, Fit for leverage-modelling of hedge fund returns in view of risk management purposes. Int. J. Contemp. Math. Sci. 4(19), 895–916 (2009)

21. F.F. Jin, L. Lin, A. Timmermann, J. Zhao, Ensemble-mean dynamics of the ENSO recharge oscillator under state-dependent stochastic forcing. Geophys. Res. Lett. (2007). https://doi.org/10.1029/2006GL027372

22. A. Keane, B. Krauskopf, C.M. Postlethwaite, Climate models with delay differential equations. Chaos Interdiscip. J. Nonlinear Sci. 27(11), 114309 (2017)

23. J. Kim, M. Kwon, S.-D. Kim, J.-S. Kug, J.-G. Ryu, J. Kim, Spatiotemporal neural network with attention mechanism for El Niño forecasts. Sci. Rep. 12(1), 7204 (2022)

24. L. Lacasa, I.P. Mariño, J. Miguez, V. Nicosia, É. Roldán, A. Lisica, S.W. Grill, J. Gómez-Gardeñe, Multiplex decomposition of non-Markovian dynamics and the hidden layer reconstruction problem. Phys. Rev. X 8(3), 031038 (2018)

25. R. Le Goff Latimier, E. Le Bouedec, V. Monbet, Markov switching autoregressive modeling of wind power forecast errors. Electric Power Syst. Res. 189, 106641 (2020)

26. B.G. Leroux, Leroux, Maximum-likelihood estimation for hidden Markov models. Stoch. Process. and Their Appl. 40(1), 127–143 (1992)

27. R.J. MacKay, Estimating the order of a hidden Markov model. Can. J. Stat. 30(4), 573–589 (2002)

28. Magnant, C., Giremus, A., Grivel, E., Ratton, L., Joseph, B.: Dirichlet-process-mixture-based Bayesian nonparametric method for Markov switching process estimation. In 2015 23rd European Signal Processing Conference (EUSIPCO), pp. 1969–1973. IEEE, (2015)

29. I.P. Mariño, J. Míguez, Monte Carlo method for multiparameter estimation in coupled chaotic systems. Phys. Rev. E 76(5), 057203 (2007)

30. G.J. McLachlan, D. Peel, Finite Mixture Models (John Wiley & Sons, New York, 2000)

31. J.G. McLachlan, T. Krishnan, The EM algorithm and extensions (John Wiley & Sons, New York, 2007)

32. V. Monbet, P. Ailliot, Sparse vector Markov switching autoregressive models. Application to multivariate time series of temperature. Comput. Stat. Data Anal. 108, 40–51 (2017)

33. G.Y. Muluye, P. Coulibaly, Seasonal reservoir inflow forecasting with low-frequency climatic indices: a comparison of data-driven methods. Hydrol. Sci. J. 52(3), 508–522 (2007)

34. E. Moulines, O. Cappè, T. Rydén, Inference in Hidden Markov Models (Springer-Verlag, Cham, 2005)

35. B. Øksendal, Stochastic differential equations, 6th edn. (Springer, Cham, 2007)

36. S.W. Phoong, S.Y. Phoong, S.L. Khek, Systematic literature review with bibliometric analysis on Markov switching model: Methods and applications. SAGE Open 12(2), 21582440221093064 (2022)

37. J. Picaut, F. Du Masia, Y. du Penhoat, An advective-reflective conceptual model for the oscillatory nature of the ENSO. Science 277(5326), 663–666 (1997)

38. Z. Psaradakis, N. Spagnolo, On the determination of the number of regimes in Markov-switching autoregressive models. J. Time Ser. Anal. 24(2), 237–252 (2003)

39. Pulford, G.W.: A survey of manoeuvring target tracking methods. arXiv:1503.07828, (2015)

40. C.P. Robert, G. Casella, Monte Carlo Statistical Methods (Springer, Cham, 2004)

41. T. Rydén, Estimating the order of hidden Markov models. Stat. J. Theor. Appl. Stat. 26(4), 345–354 (1995)

42. D. Salinas, V. Flunkert, J. Gasthaus, T. Januschowski, DeepAR: Probabilistic forecasting with autoregressive recurrent networks. Int. J. Forecast. 36(3), 1181–1191 (2020)

43. C.A. Sims, D.F. Waggoner, T. Zha, Methods for inference in large multiple-equation Markov-switching models. J. Econ. 146(2), 255–274 (2008)

44. R. Stelzer, On Markov-switching ARMA processes-stationarity, existence of moments, and geometric ergodicity. Economet. Theor. 25(1), 43–62 (2009)

45. M.J. Suarez, P.S. Schopf, A delayed action oscillator for ENSO. J. Atmos. Sci. 45(21), 3283–3287 (1988)

46. C. Wang, A review of ENSO theories. Natl. Sci. Rev. 5(6), 813–825 (2018)

47. J.-F. Yao, J.-G. Attali, On stability of nonlinear AR processes with Markov switching. Adv. Appl. Probab. 32(2), 394–407 (2000)

48. J. Zhang, R.A. Stine, Autocovariance structure of Markov regime switching models and model selection. J. Time Ser. Anal. 22(1), 107–124 (2001)

49. Zheng, F.: Learning and smoothing in switching Markov models with copulas. PhD thesis, Ecole Centrale de Lyon, (2017)

## Funding

This work has been partially supported by the Office of Naval Research (award N00014-22-1-2647) and Spain’s Agencia Estatal de Investigación (ref. PID2021-125159NB-I00 TYCHE) funded by MCIN/AEI/10.13039/501100011033 and by “ERDF A way of making Europe”.

## Author information

Authors

### Contributions

José A. Martínez-Ordóñez has contributed to the design of the work, the analysis of data, the creation of new software, and the draft and revision of the manuscript. Javier López-Santiago has contributed to the acquisition, analysis and interpretation of data, and the revision of the manuscript. Joaquín Míguez has contributed to the conception and design of the work, the analysis of data, and the draft and revision of the manuscript.

### Corresponding author

Correspondence to Joaquín Miguez.

## Ethics declarations

### Competing interests

The authors declare that there are no Conflict of interest related to this research.

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Appendices

### 1.1 Proof of Proposition 1

Let $$C_0^q:= \mathbb {E}_{\varvec{x}_0}\left[ \Vert \varvec{x}_0 \Vert ^q \right] < \infty$$ and let

\begin{aligned} c_\infty ^{l,q}:= \sup _{n\ge 0} \mathbb {E}_{\varvec{x}_n^{(l)}}\left[ \Vert \varvec{x}_n^{(l)} \Vert ^q \right] , \quad l=1, \ldots , L. \end{aligned}

Since the random processes $$\{ \varvec{x}_n^{(l)} \}_{n\ge 0}$$ are q-stable, the constants $$c_\infty ^{l,p}$$ are finite. Also, since $$L<\infty$$, it follows that $$C_\infty ^q:= \max _{l\in \{1,...,L\}} c_\infty ^{l,q}<\infty$$ (because it is a sum of a finite number of finite constants).

From the Markov chain $$\{l_n\}_{n \ge 0}$$ we construct a discrete, bivariate random sequence $$\{s_k,r_k\}_{k\ge 1}$$ in the following way:

• $$(s_1,r_1)=(v,m)\in \{1,..., L\} \times \mathbb {N}$$ if, and only if, $$l_{1:m}=v$$ and $$l_{m+1} \ne v$$;

• $$(s_k,r_k)=(v,m) \in \{1,..., L\} \times \mathbb {N}$$ if, and only if, $$l_n=v$$ for $$n=\sum _{i=1}^{k-1} r_i + 1, \ldots , \sum _{i=1}^k r_i$$ and $$l_n \ne v$$ for $$n = 1+\sum _{i=1}^k r_i$$.

Intuitively, $$\{s_k,r_k\}_{k \ge 1}$$ describes how the sequence $$\{l_n\}_{n \ge 0}$$ can be split in subsequences of equal consecutive layers (leaving the initial layer, $$l_0$$, aside). For example, if $$L=3$$ and $$l_{0:7} = \{2, 1, 1, 1, 3, 3, 2, 2\}$$ then $$(s_1,r_1)=(1,3)$$, $$(s_2,r_2)=(3,2)$$ and $$(s_3,r_3)=(2,2)$$. The random sequence $$\{s_k,r_k\}_{k\ge 1}$$ is measurable w.r.t. the $$\sigma$$-algebra generated by the Markov chain $$\{l_n\}_{n\ge 0}$$. In particular, if we choose a specific realisation of the Markov chain, and denote it $$\{l_n^*\}_{n\ge 0}$$, then we can determine the corresponding realisation of $$\{s_k,r_k\}_{k\ge 1}$$, which we denote $$\{s_k^*,r_k^*\}_{k\ge 1}$$.

We now use the fixed (but otherwise arbitrary) sequence $$\{s_k^*,r_k^*\}_{k\ge 1}$$ to prove, by induction in the index k, that $$\sup _{n \ge 0} \mathbb {E}_{\varvec{x}_n}\left[ \Vert \varvec{x}_n \Vert ^q | l_{0:\infty }^* \right] \le C_\infty ^q \vee C_0^q$$, where $$\mathbb {E}_{\varvec{x}_n}[\cdot | l^*_{0:\infty }]$$ denotes expectation w.r.t. $$\varvec{x}_n$$ conditional on the realisation of the Markov chain $$\{l_m=l_m^*\}_{m\ge 0}$$.

For $$k=1$$, we have $$l^*_n=s_1^*$$ for $$n=1, \ldots , r_1^*$$, hence the r.v.’s $$\varvec{x}_{1:r_1^*}$$ are generated with the dynamics of layer $$s_1^*$$. If we construct a patched version $$\{\bar{\varvec{x}}_n^{(s_1^*)}\}_{n\ge 0}$$ of $$\{\varvec{x}_n^{(s_1^*)}\}_{n\ge 0}$$, with the patch consisting of the initial state (i.e. $$n_0=0$$ and $$\bar{\varvec{x}}_0^{(s_1^*)} = \varvec{x}_0$$ in distribution), then it is apparent that $$\varvec{x}_{0:r_1^*} = \bar{\varvec{x}}_{0:r_1^*}^{(s_1^*)}$$ in distribution. As a consequence, it follows that

\begin{aligned} \sup _{0 \le n \le r_1^*} \mathbb {E}_{\varvec{x}_n}\left[ \Vert \varvec{x}_n \Vert ^q | l^*_{0:\infty } \right] \le \sup _{n\ge 0} \mathbb {E}_{\bar{\varvec{x}}_n^{(s_1^*)}}\left[ \Vert \bar{\varvec{x}}_n^{(s_1^*)} \Vert ^q | l^*_{0:\infty } \right] . \end{aligned}
(A1)

Moreover, the sequence $$\{\varvec{x}_n^{(s_1^*)}\}_{n\ge 0}$$ is q-stable by assumption, which implies that

\begin{aligned} \sup _{n\ge 0} \mathbb {E}_{\bar{\varvec{x}}_n^{(s_1^*)}}\left[ \Vert \bar{\varvec{x}}_n^{(s_1^*)} \Vert ^q | l^*_{0:\infty } \right] \le c_\infty ^{s_1^*,q} \vee \sup _{n\le n_0} \mathbb {E}_{\bar{\varvec{x}}_n^{(s_1^*)}}\left[ \Vert \bar{\varvec{x}}_n^{(s_1^*)} \Vert ^q | l^*_{0:\infty } \right] = C_\infty ^q \vee C_0^q, \end{aligned}
(A2)

where we have used the fact that $$c_\infty ^{l,q} \le C_\infty ^q$$ for every $$l=1,..., L$$ and $$n_0=0$$. Substituting (A2) into (A1) yields

\begin{aligned} \sup _{0 \le n \le r_1} \mathbb {E}_{\varvec{x}_n}\left[ \Vert \varvec{x}_n \Vert ^q | l^*_{0:\infty } \right] \le C_\infty ^q \vee C_0^q \end{aligned}

and completes the base case. Note that the constants $$C_\infty ^q$$ and $$C_0^q$$ are independent of the choice of $$l^*_{0:\infty }$$.

For the induction step, assume that

\begin{aligned} \sup _{0 \le n \le \sum _{i=1}^{k-1} r_i^*} \mathbb {E}_{\varvec{x}_n}\left[ \Vert \varvec{x}_n \Vert ^q | l^*_{0:\infty } \right] \le C_\infty ^q \vee C_0^q. \end{aligned}
(A3)

By construction, $$l_n^*=s_k^*$$ for $$n=1+\sum _{i=1}^{k-1} r_i^*, \ldots , \sum _{i=1}^k r_i^*$$. If we choose the patch $$\bar{\varvec{x}}_n^{(s_k^*)} = \varvec{x}_n$$ (with equality in distribution) for $$n=0, \ldots , \sum _{i=1}^{k-1} r_i^*$$ and let $$\bar{\varvec{x}}_n^{(s_k^*)}$$ be generated by the $$s_k^*$$-th layer for $$n > \sum _{i=1}^{k-1} r_i^*$$ then the sequence $$\{ \bar{\varvec{x}}_n^{(s_k^*)} \}_{n\ge 0}$$ is a patched version of $$\{ \varvec{x}_n^{(s_k^*)} \}_{n\ge 0}$$, with $$n_0 = \sum _{i=1}^{k-1} r_i^*$$, that satisfies the inequality

\begin{aligned} \sup _{0 \le n \le \sum _{i=1}^k r_i^*} \mathbb {E}_{\varvec{x}_n}\left[ \Vert \varvec{x}_n \Vert ^q | l^*_{0:\infty } \right] \le \sup _{n\ge 0} \mathbb {E}_{\bar{\varvec{x}}_n^{(s_k^*)}}\left[ \Vert \bar{\varvec{x}}_n^{(s_k^*)} \Vert ^q | l^*_{0:\infty } \right] \end{aligned}
(A4)

because $$\varvec{x}_n = \bar{\varvec{x}}_n^{(s_k^*)}$$, in distribution, for $$n = 0, \ldots , \sum _{i=1}^k r_i^*$$. Moreover, since $$\{\varvec{x}_n^{(s_k^*)}\}_{n\ge 0}$$ is q-stable, its patched version satisfies the inequality

\begin{aligned} \sup _{n\ge 0} \mathbb {E}_{\bar{\varvec{x}}_n^{(s_k^*)}}\left[ \Vert \bar{\varvec{x}}_n^{(s_k^*)} \Vert ^q | l^*_{0:\infty } \right] \le c_\infty ^{s_k^*,q} \vee \sup _{0 \le n \le n_0} \mathbb {E}_{\bar{\varvec{x}}_n^{(s_k^*)}}\left[ \Vert \bar{\varvec{x}}_n^{(s_k^*)} \Vert ^q | l^*_{0:\infty } \right] , \end{aligned}
(A5)

where $$n_0 = \sum _{i=1}^{k-1} r_i$$. Again, $$\varvec{x}_n = \bar{\varvec{x}}_n^{(s_k^*)}$$, in distribution, for $$n = 0, \ldots , \sum _{i=1}^{k} r_i^*$$, hence we can substitute the induction hypothesis (3) into (5) to obtain

\begin{aligned} \sup _{n\ge 0} \mathbb {E}_{\bar{\varvec{x}}_n^{(s_k^*)}}\left[ \Vert \bar{\varvec{x}}_n^{(s_k^*)} \Vert ^q | l^*_{0:\infty } \right] \le c_\infty ^{s_k^*,q} \vee \left( C_\infty ^q \vee C_0^q\right) \le C_\infty ^q \vee C_0^q, \end{aligned}
(A6)

where the last inequality follows trivially because $$c_\infty ^{l,q} \le C_\infty ^q$$ for every $$l\in \{1, \ldots , L\}$$. Substituting (6) back into (4) yields

\begin{aligned} \sup _{0 \le n \le \sum _{i=1}^k r_i} \mathbb {E}_{\varvec{x}_n}\left[ \Vert \varvec{x}_n \Vert ^q | l^*_{0:\infty } \right] \le C_\infty ^q \vee C_0^q \end{aligned}
(A7)

and completes the induction step. Therefore, (7) holds for every k and it follows that

\begin{aligned} \sup _{n \ge 0} \mathbb {E}_{\varvec{x}_n}\left[ \Vert \varvec{x}_n \Vert ^q | l^*_{0:\infty } \right] \le C_\infty ^q \vee C_0^q. \end{aligned}
(A8)

Next, we note that

\begin{aligned} \mathbb {E}_{\varvec{x}_n}\left[ \Vert \varvec{x}_n \Vert ^q \right] = \mathbb {E}_{l_{1:\infty }}\left[ \mathbb {E}_{\varvec{x}_n}\left[ \Vert \varvec{x}_n \Vert ^q | l_{0:\infty } \right] \right] \le \mathbb {E}_{l_{1:\infty }}\left[ \sup _{n\ge 0} \mathbb {E}_{\varvec{x}_n}\left[ \Vert \varvec{x}_n \Vert ^q | l_{0:\infty } \right] \right] \end{aligned}
(A9)

and, since the constants $$C_\infty ^q$$ and $$C_0^q$$ in (8) hold for arbitrary $$l^*_{0:\infty }$$, it follows that $$\mathbb {E}_{\varvec{x}_n}\left[ \Vert \varvec{x}_n \Vert ^q \right] \le C_\infty ^q \vee C_0^q$$ for every n, i.e.

\begin{aligned} \sup _{n\ge 0} \mathbb {E}\left[ \Vert \varvec{x}_n \Vert ^q \right] < \infty . \end{aligned}
(A10)

It remains to show that the patched versions of $$\varvec{x}_n$$ also have bounded moments of order q. To that end, simply choose an arbitrary patch $$\bar{\varvec{x}}_{0:m}$$ and let $$\{\bar{\varvec{x}}_n\}_{n>m}$$ be generated by the cDN-ARMS(L) model, choose an arbitrary sequence $$l_{m:\infty }^*$$ and construct the sequence $$\{s_k^*,r_k^*\}_{k\ge 1}\}$$ that starts with $$s_1^*=l_{m+1}^*$$. Then, the same induction argument as above, starting with the base case at $$n=m+1$$, shows that $$\sup _{n\ge 0} \mathbb {E}\left[ \Vert \bar{\varvec{x}}_n \Vert ^q \right] < C_\infty ^q \vee C_m^q$$. $$\square$$

### 1.1 Accelerated random search algorithm

Let $$f: \mathcal {R}\subseteq \mathbb {R}^d \rightarrow [0,\infty )$$ be a real objective function and consider the optimisation problem $$\hat{\boldsymbol{\beta }} \in \arg \max _{\boldsymbol{\beta } \in \mathcal {R}} f(\boldsymbol{\beta })$$. The accelerated random search algorithm of [3] (see also [29] for some extensions) is an iterative method for global optimisation that performs a Monte Carlo search on a sequence of balls of varying radius. The algorithm can be outlined as follows:

• Initialisation: choose a minimum radius $$r_{\min }>0$$, a maximum radius $$r_{\max }>r_{\min }$$, a contraction factor $$c>1$$ and an (arbitrary) initial solution $$\boldsymbol{\beta }_0$$. Set $$r_1 = r_{max}$$.

• Iteration: denote the solution at $$(n-1)$$-th iteration as $$\boldsymbol{\beta }_{n-1}$$ and let $$B_n:= \left\{ \boldsymbol{\beta } \in \mathcal {R}: \Vert \boldsymbol{\beta } - \boldsymbol{\beta }_{n-1} \Vert < r_n \right\}$$. To compute a new solution $$\boldsymbol{\beta }_n$$, take the following steps:

1. 1.

Draw $$\tilde{\boldsymbol{\beta }}$$ from the uniform probability distribution on $$B_n$$.

2. 2.

If $$f(\tilde{\boldsymbol{\beta }}) > f(\boldsymbol{\beta }_{n-1})$$ then set $$\boldsymbol{\beta }_n = \tilde{\boldsymbol{\beta }}$$ and $$r_{n+1} = r_{\max }$$. Otherwise, set $$\boldsymbol{\beta }_n = \boldsymbol{\beta }_{n-1}$$ and $$r_{n+1} = \frac{r_{n}}{c}$$.

3. 3.

If $$r_{n+1}<r_{\min }$$, then $$r_{n+1}=r_{\max }$$

The algorithm can be iterated a fixed number of times or stopped when $$\boldsymbol{\beta }_n = \boldsymbol{\beta }_{n-1} = \ldots = \boldsymbol{\beta }_{n-r}$$ for a prescribed, sufficiently large $$r>0$$.

### 1.1 Markov switching models with linear AR(K) layers

In Sect. 5 we compare the numerical performance of the proposed cDN-ARMS(L) model with Markov switching models, also with L layers, where each layer is a linear AR(K) process. These models can be explicitly described as

\begin{aligned} \varvec{x}_n = \sum _{k=1}^K a_k[l_n] \varvec{x}_{n-k} + \sigma [l_n] \varvec{z}_n, \end{aligned}

where $$\{l_n\}_{n\ge 0}$$ is a Markov chain taking values in the set $$\{1, \ldots , L\}$$ and characterised by the $$L \times L$$ transition matrix $$\varvec{M}$$, $$\varvec{z}_n$$ is a sequence of i.i.d. standard Gaussian r.v.’s and $$\{ a_1[l], \ldots , a_K[l], \sigma [l]: l=, \ldots , L \}$$ is the set of $$L(K+1)$$ model parameters (i.e. $$K+1$$ parameters per layer).

In our computer simulations fit these parameters using the SA-EM algorithm of Sect. 3.3, with the peculiarities that (a) there are no unknown delays and (b) the set $$\Lambda _c$$ is empty, i.e. the update of the parameters $$\{ a_1[l], \ldots , a_K[l], \sigma [l]: l=, \ldots , L \}$$ at each maximisation step can be done exactly.

## Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

Martínez-Ordoñez, J.A., López-Santiago, J. & Miguez, J. Maximum likelihood inference for a class of discrete-time Markov switching time series models with multiple delays. EURASIP J. Adv. Signal Process. 2024, 74 (2024). https://doi.org/10.1186/s13634-024-01166-8

• Accepted:

• Published:

• DOI: https://doi.org/10.1186/s13634-024-01166-8