TLGRU: time and location gated recurrent unit for multivariate time series imputation

Multivariate time series are widely used in industrial equipment monitoring and maintenance, health monitoring, weather forecasting and other fields. Due to abnormal sensors, equipment failures, environmental interference and human errors, the collected multivariate time series usually have certain missing values. Missing values imply the regularity of data, and seriously affect the further analysis and application of multivariate time series. Conventional imputation methods such as statistical imputation and machine learning-based imputation cannot learn the latent relationships of data and are difficult to use for missing values imputation in multivariate time series. This paper proposes a novel Time and Location Gated Recurrent Unit (TLGRU), which takes into account the non-fixed time intervals and location intervals in multivariate time series and effectively deals with missing values. We made necessary modifications to the architecture of the end-to-end imputation model E2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${E}^{2}$$\end{document} GAN and replaced Gated Recurrent Unit for Imputation (GRUI) with TLGRU to make the generated fake sample closer to the original sample. Experiments on a public meteorologic dataset show that our method outperforms the baselines on the imputation accuracy and achieves a new state-of-the-art result.

Among the related methods of missing values imputation, the methods, such as Mean, Median, Mode and Last Observed are easy to operate, which makes it difficult to restore the real data attributes and the imputation effect is general [1]. The Regression Imputation [2] is prone to random errors, resulting in large fluctuations in the imputation effect. k-Nearest Neighbor (KNN) [3], Clustering [4], Expectation Maximization (EM) [5,6], and Multiple Imputation (MI) have high computational complexity and low efficiency, which makes it difficult to impute multivariate time series. The imputation methods Recurrent Neural Networks (RNN) [7]-based can learn the latent relationships and regularity of the time series and have been used to impute missing values.
Recently, Generative Adversarial Network (GAN) [8]-based imputation methods have gradually become a research hotspot and have achieved better results in the field of missing values imputation in multivariate time series. In order to learn the latent relationships between observations with non-fixed time intervals, Luo et al. [9] proposed a novel RNN cell called Gated Recurrent Unit for Imputation (GRUI), which can take into account the non-fixed time intervals and fade the influence of the past observations determined by the time intervals. Based on GRUI, Luo et al. [10] further proposed an end-to-end GAN-based imputation model E 2 GAN which consists of a generator and a discriminator. After adversarial training of the generator and the discriminator, the generator can generate complete time series that fits the distribution of the original dataset and is used to impute the missing values. E 2 GAN achieved a better imputation accuracy, however, GRUI only considers the time interval information between two observations of missing time series, ignoring the equally important location interval information.
The contribution of this paper is to make full use of the time interval and location interval information between observations of missing time series based on GRUI, we propose a novel GRU cell called Time and Location Gated Recurrent Unit (TLGRU). The experiments on a real meteorologic dataset show that our method achieves a new state-of-the-art imputation accuracy with similar time efficiency to GRUI.

Related works
The research on missing values imputation methods has received extensive attention from researchers, and various theoretical methods have been proposed in the industry. In the statistics-based imputation methods, Kantardzic [11] tried to impute missing values by mean value. Purwar et al. [12] used mode to impute missing values. Amiri et al. [13] used last observation to impute missing values for incomplete data. The imputation methods statistics based does not consider the characteristics of the missing values, the imputation result is affected by the observed values and the imputation accuracy is poor.
In the machine learning-based imputation methods, Hudak et al. [14] used the mean value of k nearest neighbors to impute missing values. White et al. [15] proposed the Multiple Imputation by Chained Equations (MICE) to impute the missing values by using an iterative regression model. Hastie et al. [16] proposed an imputation method based on Matrix Factorization (MF), which treats the original dataset as a matrix, and decomposes the original matrix into the product of two matrices using algorithms such as Principal Component Analysis (PAC), and finally imputes the missing values with the product result. Ogbeide [5] proposed a Mode-Related Expectation Adaptive Maximization (MEAM) for obtaining better statistical inferences from multivariate data with There are also many RNN-based imputation methods in the field of multivariate time series. Berglund et al. [7] proposed two probabilistic interpretations of bidirectional recurrent neural networks that can be used to reconstruct missing samples efficiently. Che et al. [18] proposed GRUD, which imputes missing values of clinical dataset with a smooth method. GRUD takes the advantage of last observed value and mean value to represent missing patterns of incomplete time series. Cao et al. [19] proposed Bidirectional Recurrent Imputation for Time Series (BRITS), which directly learns the missing values in a bidirectional recurrent dynamical system, without any specific assumption.
The imputation methods GAN-based seek to generate new samples that obey the distribution of the training dataset, have been used to impute missing values, and achieved high imputation accuracy. Yoon et al. [20] proposed Generative Adversarial Imputation Nets (GAIN), which uses a hint vector that is conditioned on what we actually observed to impute missing values. GAIN made tremendous advances in data imputation. Shang et al. [21] proposed a GAN-based missing values imputation algorithm for multimodal data, which can learn the common properties of multimodal data and impute missing values in certain modal data. Luo et al. [10,22] proposed E 2 GAN, which takes a compressing and reconstructing strategy to automatically learns internal representations of the time series and tries its best to reconstruct this temporal data. E 2 GAN also improves the imputation performance by getting a better feature representation of samples, which contributes to better reconstructed samples and improves the imputation. Optimization of E 2 GAN was achieved by Zhang et al. [23] using real data during the training of the generator to force the imputed values to be close to the real ones.

Problem formulation
Given a d-dimensional multivariate time series X , observed at T=(t 0 , t 1 , · · · , t n−1 ) and L =(l 0 , l 1 , · · · , l n−1 ) , is denoted by X=(x 0 , x 1 , · · · , x n−1 ) ∈ R d×n , where T is the observing timestamp and L is the observing locationstamp, x i is the ith observation of X , and x j i is the jth dimension of x i .
Suppose that d-dimensional time series X is incomplete, the M ∈ R d×n is a mask matrix that takes values in {0, 1} . M means whether the values of X exist or not, if In the following is an example of a 3-dimensional multivariate time series X and its correspondingM , T andL . "/" is missing value. We define a matrix δ t ∈ R d×n that records the time interval between current value and last observed value. The following part shows the calculation and a calculated example of δ t .
We define a matrix δ l ∈ R d×n that records the location interval between current value and last observed value. The following part shows the calculation and a calculated example of δ l .

Approach
In this part, we show the details of the TLGRU and the method E 2 GAN-based for multivariate time series missing values imputation. The overall architecture of the proposed method is shown in Fig. 1. We replaced GRUI with TLGRU in the architecture of E 2 GAN and achieved a new state-of-the-art imputation accuracy.
The imputation method consists of a generator (G) and a discriminator (D). The generator is composed of an auto-encoder and recurrent neural networks. We take a compressing and reconstructing strategy to compress the input incomplete time series X into a low-dimensional vector z by the encoder. Then we use vector z to reconstruct a complete time series X ′ by the decoder. The discriminator tries to distinguish the original X = incomplete time series X and the fake but complete sample X ′ . After the adversarial training, the generator generates complete time series X ′ that can fool the discriminator, and the discriminator can best determine the authenticity of X ′ . Traditional GAN is difficult to maintain long-term stable training and is prone to mode collapse. Arjovsky et al. [24] proposed the Wasserstein GAN (WGAN), which can improve learning stability and get away from the problem of mode collapse. In our method, we use WGAN instead of GAN. The following are the loss functions of WGAN.

Time and location gated recurrent unit
Multivariate time series have certain latent relationships and regularity between adjacent observations in the same dimension and observations in different dimensions. When imputing multivariate time series missing values, not only the relationships between missing values and observations of the same dimension, but also the relationships between missing values and observations of different dimensions should be considered. Most of the current missing values imputation methods lack consideration of the relationships between observations and are difficult to be used for imputing missing values.
In multivariate time series, due to the existence of missing values, two adjacent observations have non-fixed time intervals and location intervals. If the data in one dimension is missing continuously for a long time, the time interval and location interval between two valid observations in that dimension will be larger than in other dimensions. The GRUI decreases the memory of the Gated Recurrent Unit (GRU) by introducing the time interval matrix. The TLGRU is improved based on the GRUI. We consider the time interval and (3) location interval information between observations, and we introduce a decay vector β to decrease the memory of GRU. The following is the update function of β.
where δ t , δ l are the time interval matrix and location interval matrix, and the hyperparameters α t , α l are the time weight and location weight. The values of α t , α l are determined by the principle of random initialization and by combining a large number of experiments. w β , b β are training parameters. The formulation of β guarantees that with the increase in time interval matrix δ t and location interval matrix δ l , the value of β decreases. The smaller δ t and δ l , the bigger β . This formulation also make sure that β ∈ (0, 1].
The TLGRU of proposed method is shown at Fig. 2. The decay vector β is a core part of the TLGRU. Before each TLGRU iteration, we update the hidden state h i−1 by decay vector β . The following are the update functions of the TLGRU.
where z is update gate, r is reset gate, h is candidate hidden state, h is current hidden state, σ is the sigmoid activation function, ⊙ is an element-wise multiplication, W z , b z , W r , b r , W h , b h are training parameters.

Generator architecture
The generator of the proposed method is shown in Fig. 3. The generator is an auto-encoder based on the TLGRU cell, including an encoder and a decoder. The generator can not only compress the incomplete time series X into a low-dimensional vector z by the encoder, but also reconstruct the complete time series X ′ from z by the decoder. Different from traditional auto-encoder, we just add some noise to destroy original samples rather than drop out some values. The random noise η is sampled from a standard distribution N (0, 0.01) , and can avoid the loss of data information in traditional auto-encoder and reduce over-fitting to a certain extent. The following are the update functions of denoising auto-encoder. Since both the encoder and the decoder use the TLGRU cell to process multivariate time series, we need input corresponding time interval matrix δ t , δ The generator tries to produce a new sample X ′ that is most similar to X , we add a squared error loss to the loss function of the generator. The following is the loss function of the generator, where is a hyper-parameter that controls the weight of the discriminative loss and the squared error loss.
First, we use zero value to replace the missing values of X at the input stage of TLGRU. Then we feed the TLGRU cell with the incomplete time series X and its interval matrix δ t and δ l . After recurrent processing of the input time series, the last hidden state of the recurrent neural network will flow to a fully connected layer. The output of this fully connected layer is the compressed low-dimensional vector z.
Next, we take z as the initial input of another fully connected layer. Then we use this output as the first input of another TLGRU cell. The current output of this TLGRU cell will be fed into the next iteration of the same TLGRU cell. At the final stage, we combine all the outputs of this TLGRU cell as the generated complete time series X ′ .

Discriminator architecture
The discriminator is composed of TLGRU cells and a fully connected layer. The task of the discriminator is to distinguish between fake complete time series X ′ and true incomplete time series X . The output of the discriminator is a probability that indicates Fig. 3 The architecture of the generator. The generator is a denoising auto-encoder which is mainly composed by the TLGRU cell the degree of authenticity. We try to find a set of parameters that can produce a high probability when we feed true incomplete time series X , and produce a low probability when we feed fake complete time series X ′ . The following is the loss function of the discriminator.
With the help of the TLGRU cell, the multivariate time series can be successfully handled. The last hidden state of the TLGRU cell is fed into one fully connected layer that outputs the p of being true. We also use the sigmoid function to make sure that p ∈ (0, 1).

Imputation
For each true incomplete time series X , we try to map it into a low-dimensional vector z and reconstruct a fake complete time series X ′ from z , so that the fake time series X ′ is most close to the X . We use the corresponding values of X ′ to impute in the missing values of X . The imputation formula can be summarized as follows.

Experiments
In this part, we will present the dataset and experiment results. In order to facilitate comparison with E 2 GAN, we also selected a meteorologic dataset as the experimental dataset. The experiments on the dataset show that our method achieves a new state-ofthe-art imputation accuracy.

Dataset
The KDD dataset is a public meteorologic dataset that comes from the KDD CUP Challenge 2018. The KDD dataset contains air quality and weather data which is hourly collected between 2017/1/30 to 2018/1/30 in Beijing. The records have a total of 12 variables which include PM2.5(ug/m 3 ), PM10(ug/m 3 ), CO(mg/m 3 ), weather, temperature, and so on. One task of the KDD dataset is the imputation accuracy task. We selected 11 common air quality and weather data observatories for our experiments. We first performed the operation of randomly dropping out 30% of records for all observatories to obtain an experimental dataset with non-fixed collection timestamps. We then randomly dropped out p percent on the variables of the experimental dataset, where p ∈ {10, 20, · · · , 80} . Finally, we imputed these time series and calculated the imputation accuracy by the mean squared error (MSE) between original values and imputed values.

Network details
We performed one task on a real public meteorologic dataset. For the KDD dataset, the input dimension is 132, the batch size is 16, the hidden unit number of all TLGRU cells is 64, and the dimension of the low-dimensional vector z is 128.

Baseline methods
We adopted different imputation methods to carry out experiments on the KDD dataset, the following is an introduction of the methods.
• Median: We use median value to impute missing values simply.
• Mean: We use mean value to impute missing values simply.
• MF: MF imputation is used to factorize the incomplete matrix into low-rank matrices and impute the missing values.  Table 1 is the imputation results on the KDD dataset by using the proposed method and other baseline methods such as Median imputation, Mean imputation, KNN imputation, MF imputation, ISVD imputation, GAIN imputation, and E 2 GAN imputation. The first column of Table 1 Table 1. We can see that in all cases, our method is one of the best methods and wins others methods in most cases.

Experimental results on KDD dataset
To further compare the performance of the proposed method with E 2 GAN model, we evaluated the effects of the discriminator and random noise η on both models. shows the results of the MSE on the KDD dataset. The first row is the missing rate. The second and third rows show the results for the proposed method and E 2 GAN. The fourth and fifth rows show the results for both models without the discriminator. The sixth and seventh rows show the results for both models without the random noise η.

Discussions
The experimental results on the KDD dataset with different percentage missing rates are shown in Table 1. We can see that in the vast majority of cases, the proposed method achieved the smallest MSE compared to other baseline methods. In the baseline methods for imputing multivariate time series, the methods GAN based, such as GAIN and E 2 GAN have a better performance than Median, Mean, KNN, MF, and ISVD. And further, the methods GRUI based, such as E 2 GAN can take into account the non-fixed time interval and fade the influence of the past observations determined by the time interval matrix, improving the imputation accuracy effectively. In the proposed method, we optimized GRUI by introducing the location interval matrix. The weights of the time interval matrix and location interval matrix are controlled by introducing the hyperparameters α t and α l . Our proposed method wins the new state-of-the-art imputation accuracy for all percentages except 10% and 50%. We can also see that the choice of values for the hyper-parameters α t and α l will affect the imputation accuracy. We selected different hyper-parameters and conducted a large number of experiments, and finally obtained the experimental results in Table 1. However, the values of the hyper-parameters in Table 1 may not be the optimal results, and we only use these values to improve the imputation accuracy. The specific numerical selection of hyper-parameter needs further research. The effects of the discriminator and random noise η on the proposed method and E 2 GAN are shown in Table 2. As we can see, the discriminator and random noise η have an impact on the imputation accuracy of both models. In particular, the proposed method outperforms E 2 GAN in the vast majority of cases in all three experiments.

Conclusion
In order to learn the latent relationships between observations with non-fixed time intervals and location intervals in multivariate time series, we propose a novel TLGRU cell for dealing with missing values. We made necessary modifications to the architecture of the end-to-end missing values imputation model E 2 GAN by replacing GRUI with TLGRU to make the generated fake sample closer to the original one, and the