In this part, we show the details of the TLGRU and the method \({E}^{2}\) GAN-based for multivariate time series missing values imputation. The overall architecture of the proposed method is shown in Fig. 1. We replaced GRUI with TLGRU in the architecture of \({E}^{2}\) GAN and achieved a new state-of-the-art imputation accuracy.

The imputation method consists of a generator (G) and a discriminator (D). The generator is composed of an auto-encoder and recurrent neural networks. We take a compressing and reconstructing strategy to compress the input incomplete time series \(X\) into a low-dimensional vector \(z\) by the encoder. Then we use vector \(z\) to reconstruct a complete time series \({X}^{^{\prime}}\) by the decoder. The discriminator tries to distinguish the original incomplete time series \(X\) and the fake but complete sample \({X}^{^{\prime}}\). After the adversarial training, the generator generates complete time series \({X}^{^{\prime}}\) that can fool the discriminator, and the discriminator can best determine the authenticity of \({X}^{^{\prime}}\).

Traditional GAN is difficult to maintain long-term stable training and is prone to mode collapse. Arjovsky et al. [24] proposed the Wasserstein GAN (WGAN), which can improve learning stability and get away from the problem of mode collapse. In our method, we use WGAN instead of GAN. The following are the loss functions of WGAN.

$${L}_{G}={\mathbb{E}}_{z\sim {P}_{g}}[-D(G(z))]$$

(3)

$${L}_{D}={\mathbb{E}}_{z\sim {P}_{g}}[D(G(z))]-{\mathbb{E}}_{x\sim {P}_{r}}[D(x)]$$

(4)

### 4.1 Time and location gated recurrent unit

Multivariate time series have certain latent relationships and regularity between adjacent observations in the same dimension and observations in different dimensions. When imputing multivariate time series missing values, not only the relationships between missing values and observations of the same dimension, but also the relationships between missing values and observations of different dimensions should be considered. Most of the current missing values imputation methods lack consideration of the relationships between observations and are difficult to be used for imputing missing values.

In multivariate time series, due to the existence of missing values, two adjacent observations have non-fixed time intervals and location intervals. If the data in one dimension is missing continuously for a long time, the time interval and location interval between two valid observations in that dimension will be larger than in other dimensions. The GRUI decreases the memory of the Gated Recurrent Unit (GRU) by introducing the time interval matrix. The TLGRU is improved based on the GRUI. We consider the time interval and location interval information between observations, and we introduce a decay vector \(\beta\) to decrease the memory of GRU. The following is the update function of \(\beta\).

$${\beta }_{i}=\frac{1}{{e}^{\mathrm{max}(0,{w}_{\beta }({\alpha }_{t}{\delta }_{{t}_{i}}+{\alpha }_{l}{\delta }_{{l}_{i}})+{b}_{\beta })}}$$

(5)

where \({\delta }_{t}\), \({\delta }_{l}\) are the time interval matrix and location interval matrix, and the hyper-parameters \({\alpha }_{t}\), \({\alpha }_{l}\) are the time weight and location weight. The values of \({\alpha }_{t}\), \({\alpha }_{l}\) are determined by the principle of random initialization and by combining a large number of experiments. \({w}_{\beta }\), \({b}_{\beta }\) are training parameters. The formulation of \(\beta\) guarantees that with the increase in time interval matrix \({\delta }_{t}\) and location interval matrix \({\delta }_{l}\), the value of \(\beta\) decreases. The smaller \({\delta }_{t}\) and \({\delta }_{l}\), the bigger \(\beta\). This formulation also make sure that \(\beta \in \left(0,\left.1\right]\right.\).

The TLGRU of proposed method is shown at Fig. 2. The decay vector \(\beta\) is a core part of the TLGRU. Before each TLGRU iteration, we update the hidden state \({h}_{i-1}\) by decay vector \(\beta\). The following are the update functions of the TLGRU.

$${h}_{i-1}^{^{\prime}}={\beta }_{i}\odot {h}_{i-1}$$

(6)

$${z}_{i}=\sigma ({W}_{z}[{h}_{i-1}^{^{\prime}},{x}_{i}]+{b}_{z})$$

(7)

$${r}_{i}=\sigma ({W}_{r}[{h}_{i-1}^{^{\prime}},{x}_{i}]+{b}_{r})$$

(8)

$${\widetilde{h}}_{i}=tanh({W}_{\widetilde{h}}[{r}_{i}\odot {h}_{i-1}^{^{\prime}},{x}_{i}]+{b}_{\widetilde{h}})$$

(9)

$${h}_{i}=(1-{z}_{i})\odot {h}_{i-1}^{^{\prime}}+{z}_{i}\odot {\widetilde{h}}_{i}$$

(10)

where \(z\) is update gate, \(r\) is reset gate, \(\widetilde{h}\) is candidate hidden state, \(h\) is current hidden state, \(\sigma\) is the sigmoid activation function, \(\odot\) is an element-wise multiplication, \({W}_{z}\), \({b}_{z}\), \({W}_{r}\), \({b}_{r}\), \({W}_{\widetilde{h}}\), \({b}_{\widetilde{h}}\) are training parameters.

### 4.2 Generator architecture

The generator of the proposed method is shown in Fig. 3. The generator is an auto-encoder based on the TLGRU cell, including an encoder and a decoder. The generator can not only compress the incomplete time series \(X\) into a low-dimensional vector \(z\) by the encoder, but also reconstruct the complete time series \({X}^{^{\prime}}\) from \(z\) by the decoder. Different from traditional auto-encoder, we just add some noise to destroy original samples rather than drop out some values. The random noise \(\eta\) is sampled from a standard distribution \(\mathcal{N}\left(0, 0.01\right)\), and can avoid the loss of data information in traditional auto-encoder and reduce over-fitting to a certain extent. The following are the update functions of denoising auto-encoder.

$$z={Encoder}(X+\eta )$$

(11)

$${X}^{^{\prime}}={Deconder}(z)$$

(12)

Since both the encoder and the decoder use the TLGRU cell to process multivariate time series, we need input corresponding time interval matrix \({\delta }_{t}\), \({\delta }_{t}^{^{\prime}}\), and location interval matrix \({\delta }_{l}\), \({\delta }_{l}^{^{\prime}}\) in the process of multivariate time series compression and reconstruction. The \({\delta }_{t}\) and \({\delta }_{l}\) represent the time interval and location interval of the original incomplete time series. The \({\delta }_{t}^{^{\prime}}\) and \({\delta }_{l}^{^{\prime}}\) represent the time interval and location interval of the reconstructed complete time series.

The generator tries to produce a new sample \({X}^{^{\prime}}\) that is most similar to \(X\), we add a squared error loss to the loss function of the generator. The following is the loss function of the generator, where \(\lambda\) is a hyper-parameter that controls the weight of the discriminative loss and the squared error loss.

$${G}_{L}=-D({X}^{^{\prime}})+\lambda \| X\odot M-{X}^{^{\prime}}\odot M{\| }_{2}$$

(13)

First, we use zero value to replace the missing values of \(X\) at the input stage of TLGRU. Then we feed the TLGRU cell with the incomplete time series \(X\) and its interval matrix \({\delta }_{t}\) and \({\delta }_{l}\). After recurrent processing of the input time series, the last hidden state of the recurrent neural network will flow to a fully connected layer. The output of this fully connected layer is the compressed low-dimensional vector \(z\).

Next, we take \(z\) as the initial input of another fully connected layer. Then we use this output as the first input of another TLGRU cell. The current output of this TLGRU cell will be fed into the next iteration of the same TLGRU cell. At the final stage, we combine all the outputs of this TLGRU cell as the generated complete time series \({X}^{^{\prime}}\).

### 4.3 Discriminator architecture

The discriminator is composed of TLGRU cells and a fully connected layer. The task of the discriminator is to distinguish between fake complete time series \({X}^{^{\prime}}\) and true incomplete time series \(X\). The output of the discriminator is a probability that indicates the degree of authenticity. We try to find a set of parameters that can produce a high probability when we feed true incomplete time series \(X\), and produce a low probability when we feed fake complete time series \({X}^{^{\prime}}\). The following is the loss function of the discriminator.

$${D}_{L}=-D(X)+D({X}^{^{\prime}})$$

(14)

With the help of the TLGRU cell, the multivariate time series can be successfully handled. The last hidden state of the TLGRU cell is fed into one fully connected layer that outputs the \(p\) of being true. We also use the sigmoid function to make sure that \(p\in \left(0,\left.1\right)\right.\).

### 4.4 Imputation

For each true incomplete time series \(X\), we try to map it into a low-dimensional vector \(z\) and reconstruct a fake complete time series \({X}^{^{\prime}}\) from \(z\), so that the fake time series \({X}^{^{\prime}}\) is most close to the \(X\). We use the corresponding values of \({X}^{^{\prime}}\) to impute in the missing values of \(X\). The imputation formula can be summarized as follows.

$${X}_{{imputed }}=M\odot X+(1-M)\odot {X}^{^{\prime}}$$

(15)