A Machine-Learning Phase Classification Scheme for Anomaly Detection in Signals with Periodic Characteristics

In this paper we propose a novel machine-learning method for anomaly detection applicable to data with periodic characteristics where randomly varying period lengths are explicitly allowed. A multi-dimensional time series analysis is conducted by training a data-adapted classifier consisting of deep convolutional neural networks performing phase classification. The entire algorithm including data pre-processing, period detection, segmentation, and even dynamic adjustment of the neural networks is implemented for fully automatic execution. The proposed method is evaluated on three example datasets from the areas of cardiology, intrusion detection, and signal processing, presenting reasonable performance.


Introduction
Many real-world systems, both natural and anthropogenic, exhibit periodic behaviour. Monitoring such systems necessarily produces periodic time series. In one particular instance of such a monitoring application, one is interested in automatically detecting changes in the periodically repeating pattern and thus anomalies in the systems operation. This type of anomaly detection occurs in a wide range of different fields and applications, be they medical, e.g. diagnosing diseases of the cardiovascular and respiratory systems, in industrial contexts, e.g. monitoring the operation of a transformer or rotating machinery, and in signal processing and communications. The pursued aims range from simple monitoring to intrusion detection and prevention.
Traditionally, anomaly detection is performed in the form of outlier detection in mathematical statistics. Numerous methods have been proposed, including but not limited to distance-and density-based techniques [1,2] and subspace-or submanifold-based techniques [3,4,5]. Most of these approaches make no explicit use of the concept of time and are therefore usually less suited for the analysis of time series. Methods making explicit use of the temporal structure include classical models from statistical time series analysis such as autoregressive-moving average (ARMA) models [6], Kalman filters [7] or more general hidden Markov models [8], and rolling-window distance-based methods such as matrix profiles [9]. Distance analysing methods are effective for clean data but not robust against noise, whereas distributionbased methods from mathematical statistics are still powerful in the presence of noise, requiring dataspecific parametrisation. In the past few years, nonlinear methods, such as different types of recurrent neural networks (RNNs) and in particular long shortterm memory (LSTM) networks, have also come into use [10,11]. Many of these methods are difficult to train [12,13,14] or need large amounts of data in order to achieve reasonable performance while avoiding overfitting. On the other hand, in recent years, convolutional neural networks (CNNs) have gained popularity in image processing [15,16] where they are used mainly for classification tasks. The same principles that are responsible for the success of CNNs in image processing carry over to other types of signal processing when the number of dimensions of the convolutional kernels is changed accordingly. Most of the work using recurrent or convolutional networks for time series analysis focusses on forecasting or detecting certain patterns explicitly known at training time.

arXiv:1811.12119v2 [eess.SP] 27 Mar 2019
On these tasks, convolutional networks have recently been shown to outperform the previously state of the art LSTMs [17].
In this paper we consider data with periodic characteristics and design a machine-learning algorithm for time series analysis, in particular anomaly detection, applying convolutional neural nets in a manner which, to the best of the authors knowledge, has not been proposed previously. In contrast to existing methods and inspired by machine-learning methods for image processing, we employ a convolutional net acting not as a predictor or estimator but as a classifier whose classes indicate phase i.e. the relative location in time. We also integrate general procedures for data pre-processing and automated phase reclustering so that no manual action is required in between.
Our algorithm is tested on three datasets: a cardiology dataset (ECG database) [18], an industrial network dataset for cyber attack research (SCADA dataset) [19], and a synthetic waveform dataset described in detail in Section 4.3. It turns out that, to a certain extent, our method is robust against unclean data and the related neural networks do not show high sensitivity to the hyperparameters and are relatively easy to train.
The remainder of the paper is organised as follows. In Section 2 we specify the types of anomaly detection considered in this paper, comment on traditional methods, and introduce the concept of our solution. In Section 3 our general approach to the considered anomaly detection problems is described in detail, including data pre-processing, mathematical basis of convolutional neural networks, and training algorithm. In Section 4 our method is fine-tuned for the three aforementioned example datasets and in Section 5 the empirical results are evaluated. In Appendix A, dealing with the issue of randomly varying period length which shows up in many real world applications such as in the ECG data (Section 4.1) and synthetic waves (Section 4.3), an auxiliary period detection scheme is designed based on classical principles of signal processing. In Appendix B we perform some comparisons with other methods for anomaly detection in order to further highlight the advantage of using a convolutional neural network in the proposed manner.

Preliminaries
In preparation for the detailed description of our machine-learning phase classfication scheme given in Section 3, in this section we clarify the tasks of anomaly detection in time series with periodic characteristics, discuss some common methods, and outline the essential ideas of our approach.

Context of this work
In general, a time series {X t } t=0,1,2,... (i.e. a temporal sequence of observations X 0 , X 1 , X 2 , . . ., also termed signal) is said to exhibit periodic behaviour with period length s if similarities occur after every s time units, i.e., observations that are s time units apart, X t0 , X t0+s , X t0+2s , . . . for any t 0 , are similar.
Periodic signals occur naturally in a wide range of applications and in a large number of fields such as audio processing, vibration analysis, biomedical engineering, climatology, and economic time series analysis. Oftentimes, one wishes to monitor the behaviour of such a system. In particular, a common task when observing a signal is that of anomaly detection, i.e., the detection of deviations from a certain normal mode of operation. This has a variety of applications such as disease diagnosis, network security, fraud recognition in bank transactions, etc.
The general approach to anomaly detection is to relate a mathematical model (parametric or nonparametric) to the normal behaviour of the underlying system based on historic observations (training data) and set a confidence region for data of normal type; applying the data-adapted model to the ongoing observations (test data), whether the output lies within or outside the pre-defined confidence region decides if the corresponding input observation is considered normal or abnormal, respectively.
As with most naturally occurring siganals, many of the aforementioned signals do not satisfy the exact mathematical definition of periodicity. Instead, they exhibit a property which is referred to as quasiperiodicity which basically means that the signal does not exactly repeat itself, but has deviations both in its values and in the length of the actual periods. This behaviour is very common for instance in biological or climatological systems. As a consequence for the task of anomaly detection, a sophisticated mathematical model is required to capture the essence of the diverse and noise corrupted signals.

Tasks of anomaly detection
Mathematically, the approach to anomaly detection proposed in this paper applies to the following two types of problems: Type A The historic observations of normal type (training data) are made up of various signals {{X (ι) t }, ι ∈ I, share certain common normal-typecharacterising features, but differ in their values and exhibit periodic characteristics with individual period length s (ι) which may also fluctuate over time. The task of anomaly detection in such a setting consists in rating each ongoing observation signal {X t } t as normal or abnormal.
Type B The historic observations of normal type (training data) are made up of consecutive single data points X 0 , X 1 , . . . , X N −1 which jointly form a time series {X t } t=0,...,N −1 . The occurance of the data points X 0 , . . . , X N −1 follows certain normaltype-characterising patterns, which is reflected in the corresponding time series {X t } t=0,...,N −1 as seasonal effects associated with period length s where s may randomly vary over time. The task of anomaly detection in such a setting consists in specifying segments of onging observations {X t } t≥N which are abnormal. Problems of Type A arise from areas such as disease diagnosis, climatology, vibration analysis, etc., whereas problems of Type B are often addressed in the security sector and building monitoring systems within the framework of signal processing. In general, establishing an adequate mathematical model for the normal behaviour of a system requires a proper amount of training data. In our experiments in Section 4, our approach to the considered problems is applied to a cardiology dataset for detecting heart disease (cf. ECG database [18]) as an example of problems of Type A, a relatively small industry dataset in the context of network security (cf. SCADA dataset [19]) as an example of problems of Type B, and a more extensive synthetic waveform dataset injected with a variety of noise and anomalies (cf. Section 4.3) again as an example of problems of Type B. The experimental results are provided in Section 5.
From a mathematical perspective, problems of Type A are more challenging than those of Type B. In the setting of Type A a considerably complex mathematical model is needed for capturing diverse variations of the normal behaviour across a variety of training signals, whereas in the setting of Type B the required mathematical model for the normal behavior is to be fitted to a single training time series. Many traditional methods for anomaly detection in periodic signals may find direct applications to problems of Type B but fail to be applicable to problems of Type A. This will be further discussed in the subsequent section.

On common methods
Let us comment on the adequacy of some traditional methods for detecting anomalies in periodic signals in our context.

Distance-analysing methods
The most straightforward treatment of seasonal data goes back to cross correlation analysis, e.g. matrix profiles [9]. The basic idea therein is to apply a rolling window and define a Euclidean-type metric which measures the distance of consecutive values within the rolling window at different locations of the underlying time series from one another or from a fixed reference sequence (e.g. a mean window consisting of seasonal means); data points exhibiting large distance from the reference value are considered abnormal.
In general, distance-analysing approaches are not resistant against noise and fail to capture complex structures in the data. In the appendix, we evaluate a simple distance-based self-similarity approach in Section B.1. We also provide a distance-based version of our phase classification scheme (without artificial neural networks) for comparison in Section B.2.

ARIMA methodology and Kalman filtering
A more sophisticated class of methods arises from mathematical statistics, e.g. autoregressive integrated moving average (ARIMA) methodology, methods based on structural component time series models or more general Kalman filtering (based on the linear case of the general state-space model or hidden Markov model), cf. [20,21] for detailed description of the corresponding mathematical models. These approaches can be directly applied to problems of Type B described in Section 2.2 and are based on relating a stochastic model with parameters Θ = {θ 1 , . . . , θ r } to the training part {X t } t=0,1,...,N −1 of the observed time series {X t } t so as to make short-term (usually one-step ahead) forecasts, i.e., to estimate the conditional expectation E[X t+∆t | X t , X t−1 , . . . ; Θ] byX t+∆t for all t (in particular for t + ∆t ≥ N ), which basically relies on calculating the maximum likelihood estimatê Θ of the parameters Θ, making use of available observations. Setting a threshold value δ, if the actual observation X t0 varies enough from the forecast valuê X t0 in the sense that |X t0 −X t0 | > δ, then the data point X t0 observed at time t 0 is considered abnormal.
Among the aforementioned stochastic models, the most demonstrative one is to decompose the underlying time series into trend, seasonal, and independent noise components, where the trend and seasonal components are assumed to be deterministic functions of time which can be fitted by a polynomial and conducting Fourier analysis, respectively. In fact, this is a special case of the general structural component time series model with trend and seasonal conmponents being stochastic processes. Each structural component model can be straightforwardly represented as a linear state-space model for which Kalman filtering comes into use to generate forecasts; it also has an equivalent ARIMA model representation for which forecasting can be conducted by following the ARIMA methodology. The ARIMA approach is based on spectral theory. For seasonal time series, a parsimonious form termed (multiplicative) seasonal ARIMA (SARIMA) model may be considered. In general, modelling a time series with an ARIMA representation requires data-specific transformation (i.e. data pre-processing e.g. logarithmising, power transformation, and differencing) and a data-adaptive hyperparameter choice (i.e. the design of the parameter set Θ = {θ 1 , . . . , θ r }, in particular the number of parameters r) which relies on inspection of the autocorrelogram and partial autocorrelogram. Each ARIMA model has an equivalent linear statespace model representation allowing Kalman filtering to be employed for forecasting.
The ARIMA approach and Kalman filtering are powerful tools in many applications and in particular in the presence of noise, provided that the hyperparameter choice is reasonable. However, being the most technically manageble segment of the general state-space models, linear models lack complexity and therefore do not always deliver a feasible approximation for real world applications. In addition, the associated dataadapted model selection including data pre-processing requires specific expert knowledge and is therefore difficult to implement for fully automatic execution as in our machine-learning framework. Furthermore, considering problems of Type A described in Section 2.2, it is unclear how to choose a general representative time series {X * t } t in which the diverse variations arising from the individual training signals {{X (ι) t } t | ι ∈ I} are incorporated so that the model fitted to {X * t } t applies to all normal signals.

Long Short-Term Memory Units
Long short-term memory units (LSTM) are a special type of recurrent neural network (RNN). As such, the LSTM reads the input time series sequentially, transforming at each point in time the input data into a hidden state which is a nonlinear function of the current input and the hidden state one time step earlier. The advantage of LSTMs over most other types of RNN is that the dependency of the current on the previous hidden state is designed in such a way that the LSTM obtains the ability to keep (parts of) its hidden state over a larger number of time steps than is possible with other RNN architectures, i.e., LSTMs are able to "memorise" values from the past.
Applying LSTMs to prediction tasks for the purpose of anomaly detection works in a similar manner to the application of statistical methods described in the beginning of Section 2.3.2. The main differences are that LSTMs allow for nonlinear parametrisation and have the potential to support a much larger number of parameters which are not estimated directly but instead are randomly initialised at first and then optimised during training (learnt) to obtain the desired predictor. The complexity of the LSTMs allows them to ingest characteristics of rich and varied training data such as those from large training data sets of Type A as described in Section 2.2 through the process of training with a stochastic gradient descent (SGD) type algorithm. The training set is processed repeatedly and the parameters of the LSTM are adjusted to optimise the quality of the forecast across the entire training dataset.
Technically, the main drawback of LSTMs is the fact that they are fundamentally still RNNs and hence also suffer from some of the difficulties typical for training this class of artificial neural network such as exploding gradients and a high potential for overfitting.
As a general drawback of using one-step ahead prediction for anomaly detection in time series, if the time series is very complex and exhibits regions in which it is difficult to make precise forecasts, such as when analysing periodic signals containing steep edges or spikes whose positions or values vary randomly over time, reliably estimating the values in these regions can actually be impossible for any type of one-step ahead prediction. It is thus difficult to derive an anomaly detector from such a predictor as the estimated values can have a large distance to the actual ones and thus show up as false positives. In Appendix B.3 this is illustrated in more detail by training and evaluating an LSTM on the ECG database.

Concept of this paper
Let us now introduce the concept of our machinelearning phase classification approach to the problems specified in Section 2.2.

Motivation of using convolutional neural
networks for phase classification Convolutional neural networks (CNNs) are a specific architecture of feed-forward neural networks. When compared to a fully connected neural network, convolutional neural networks need fewer parameters. Hence they do not require as large a training dataset and are less prone to overfitting. CNNs make explicit use of the temporal or spatial structure of the input signal; the signal is analysed locally (local receptive fields) and in a shift invariant manner (translation invariance). Investiagtions on the internal representations present throughout the layers of CNNs show a high tolerance to noise of various kinds.
Like LSTMs, compared to statistical methods, CNNs have the advantage that, through the use of multiple channels and non-linearities, they provide enough flexibility to capture intricate structures of analysed signals and are able to find representations for large and varied data sets. They are however easier to train than RNNs, as they suffer less from the vanishing and exploding gradient problems. The capability of a CNN of being able to process high amounts of complexity has been analysed in the field of image processing, where it was shown [22] that the neurons inside a convolutional neural network can activate on patterns ranging from simple edges to things as complex as faces.
While CNNs can be used to make forecasts in time series, they particularly excel at classification of spatial or temporal data. Since the main problem of the LSTM-based approach to anomaly detection in time series outlined above is the general unfeasability of using one-step ahead forecasts, we capitalise on the strength of CNNs in classification tasks and devise a new type of anomaly detection scheme relying on phase classification instead of one-step ahead forecasting. More details on the properties and operation of CNNs are given in Section 3.2.

Phase classification and anomaly detection
Motivated by the advantages of convolutional neural networks in classification tasks when dealing with spatial or temporal data, the maching-learning approach proposed in this paper is based on the following key ideas: 1 Conducting multi-dimensional time series analysis by means of multi-channel deep convolutional neural networks, where each channel in the input layer corresponds to a single feature (dimension) of the considered time series 2 Identifying phases or, equivalently, relative locations (order of occurrence) of subpatterns from time series with periodic characteristics by means of training data-adapted classifiers so that subpatterns over different periods of the underlying time series are properly separated into a certain amount of classes To be more specific, considering a seasonal time series {X t } t with period begins (e.g. time of local peak values) {τ k } k , for a pre-determined initial number of classes n 0 , sampling from the original signal n 0 overlapping segments per period with a sliding window of length T , each subpattern {X m ∈ N, is assigned to the class labelled m mod n 0 .
For seasonal data, subpatterns sampled from the time series occur repeatedly and in fixed order within each single period. A successfully trained classifier outputs the correct class indicating the phase or, equivalently, the relative location in time (i.e. time distance between subpattern and period begin) of the input subpattern. Abnormal datapoints in an input pattern are expected to cause false classification results and therefore to be identified as anomalies, which yields a direct solution to problems of Type B described in Section 2.2. For problems of Type A described in Section 2.2, setting a minimum expected classification accuracy (threshold value) and evaluating the classification accuracy of each test signal over a certain number of periods (which is denoted by K in the sequel), those signals that fail to achieve this minimum are considered abnormal.
In order to optimise the classification accuracy of normal data and hence prevent false positive anomaly detection results, we carry out a dynamic reclustering which cancels confusing classes, i.e., subpatterns within a period of the signal that are similar enough to one another are merged into one class. This reclustering procedure along with the optimisation of the stride length ∆t := s/n 0 (i.e. time distance between the segments to be classified) is implemented as a dynamic model selection scheme integrated in our training algorithm. In addition, we design an auxiliary period detection scheme which is employed in case of a randomly varying period length s.
The block diagram in Figure 1 outlines the major steps of our training algorithm and anomaly detection scheme where the steps marked by dashed lines are conditioned by some model-adequacy monitoring criteria which are described in the subsequent section.

Method
In this section we present the general procedure of our phase classification scheme in detail and provide some guidelines for the hyperparameter choice.

Data pre-processing
Prior to being fed into the classifier neural nets, all input signals (including training, validation, and test data) are processed by a period detector, cut into overlapping segments by a sliding window, and subsequently normalised, where the segmentation and normalisation depend on the initial number of classes n 0 .

Period detection
In general, the seasonal effects of a time series can be recognised by examining the autocorrelogram (cf. [20, 2.1.4]) or periodogram (cf. [20, 2.2.1]). In many cases the period length s is fixed and known. In case of a fluctuating s (cf. e.g. data from cardiology), an auxiliary period detector is designed in Appendix A, capturing the time of local extremum values (considered as period begins in our setting) {τ k } k within individual periods and using cross-correlations in order to achieve robust period detection. Note that in our setting for randomly varying period length s, the stride length ∆t = s/n 0 while segmenting the signal varies proportionally to s so that the number of overlapping segments from each period is fixed and equal to n 0 .
period detector segmentation with n0 segments per period normalisation initial labelling initialse CNN w.r.t. n0 train CNN with SGD, repeatedly using input from X trained classifier neural network train new CNN trained classifier neural network w.r.t. n n0 (optimal n for n0) update n reclustering update n0 dynamic model selection optimal classifier neural network w.r.t. optimal n0 and n n0 training signals period begins Type A:

Sliding window
The classification accuracy of our approach turns out not to be highly sensitive to the length of the sliding window T . In the context of anomaly detection, the value of T should be kept relatively small (e.g. less than or equal to three times the average duration of a single abnormal data sequence) in order to highlight the local effect of the abnormal data points on the time series. We use a window size of T = 3s/n 0 (approximately three times the stride length) where s refers to the average value of s (recall that in general s may vary over time). Empirically, this has proven to be adequate for our purpose. Note that the length of the sliding window remains constant even in the case of randomly varying period length s, the varying stride length merely affects the amount of overlap between adjacent sliding windows.

Normalisation
In order to remove trend components and avoid skewed results due to dominating extreme values, the samples within the sliding window are normalised by adjusting the local mean and variance, that is, each with period begins {τ k } k to be processed by a classifier neural network corresponding to initial number of classes n 0 , for i = 0, . . . , d − 1 and m ∈ N, the vector (X i,(m) t ) t=0,...,T −1 is fed into channel i of the convolutional neural net, wherẽ For the training and validation data, each subpat-ternX (m) := {X (m) t } t=0,...,T −1 is initially labelled m mod n 0 . If reclustering occurs during the training so that the training and validation inputs are relabelled (cf. Section 3.3.3 for more details), then the test data are labelled in accordance with the final labelling of the training and validation data.

Convolutional neural networks
The core of our phase classifier is a convolutional neural network (CNN). CNNs are a special type of feedforward neural network, which exploit structures of space or time by sharing many of the weights among different neurons. We provide a short description of the mathematical basis of a convolutional neural network. For more detail on the subject, we refer the reader to the literature, e.g. [23,Ch. 9].
Basically, a feedforward neural network is a function f : R N (0) × R P −→ R n , mapping an input vector x ∈ R N (0) to an output vector y = f (x, p) ∈ R n , using a vector of parameters p ∈ R P to adapt the mapping. When acting as a classifier, n is the number of classes and the predicted class of a given input x is taken to be arg max j<n y j . The network can be decomposed into layers, each of which represents a different function mapping vectors to vectors, i.e., where L is the number of layers and for l = 0, . . . , L−1 are the transformations performed by each of the single layers and the vectors p (l) ∈ R P (l) are again parameter vectors used to adapt the mapping and given as subvectors of p. For ease of notation, let us denote the input to the function f (l) by x (l) , starting with x (0) = x and the output of the function f (l) by x (l+1) , ending with x (L) = y.
In the most simple feedforward neural networks, each of the transformations f (l) is given by a multiplication with a matrix a (l) ∈ R N (l) ×N (l+1) called the weight matrix followed by an addition of a vector b (l) ∈ R N (l+1) called the bias vector followed by the application of some non-linear function g : R −→ R called the activation function to each of the components of the resulting vector, i.e., .
The entries of the matrix a (l) and the vector b (l) are exactly the components of the parameter vector p (l) . This is called a fully connected layer.
In the case of a one-dimensional convolutional layer, the affine transformation x (l) −→ x (l) · a (l) + b (l) is replaced with a more restrictive kind of affine transformation, the so-called batched convolution. For this, the vector is fed into the i-th channel of the convolutional layer l. [1] Similarly, the parameter vector p (l) is distributed not into a matrix a (l) and a vector b (l) , but into a matrix of vectors (k (l) i,j,s ) i<M (l) ,j<M (l+1) ,s<S (l) called the convolution kernels and a vector b (l) ∈ R M (l+1) . The operation performed by the function f (l) is now given by and we have the constraint that T (l+1) = T (l) −S (l) +1. In many convolutional networks, the input vectors (x (l) i,t ) t<T (l) are extended (padded) by additional zero entries prior to being convolved. When padding with exactly S (l) − 1 zeros, the output vectors are of the same size as the input vectors. If furthermore the padding is performed symmetrically, i.e., if (S (l) −1)/2 zeros are added to both ends of the signal, this is referred to as 'SAME'-padding.
We also use a type of layer called a max pooling layer between two convolutional layers. The transfer func- where R (l) is a positive integer called the pool size and we have the constraints M (l+1) = M (l) and T (l+1) = T (l) /R (l) . Note that max pooling layers have no adjustable parameters p (l) .
In our networks, we employ both convolutional and regular fully connected layers. We apply 'SAME'padding in all convolutional layers and use the hyperbolic tangent (tanh) as activation function g throughout the entire network, which is a common choice in feedforward neural networks. [1] Vice versa, when placing a fully connected layer after a convolutional layer, the inverse reindexing is performed. When using a convolutional layer as the first layer of an artificial neural network and the input is in fact a segment of a multivariate time series {Xi,t} i<M,t with M = M (0) features, no reindexing is required. This is the case in all of our set-ups.
The exact layout of the convolutional network used for our task is displayed in Table 1. Here d, T , and n denote the dimension of the input time series, the sliding window size, and the current number of classes, respectively. The layer and kernel sizes are chosen to best adapt to varying input time series dimensions, sliding window sizes, and numbers of classes. In the convolutional layers, the number of channels is increased first by a factor of 6, then by a factor of 3. Such increases are common in convolutional neural networks and allow the layers to capture different aspects of the incoming signal such as edges and more complex patterns. The method, by which the parameter vector p is adjusted and the network adapts, is the minimisation of a function h : R n −→ R applied to the output of the neural network, called the loss function. Since we are classifying phases, to each training input x (and hence to each output y), there corresponds a label z ∈ {0, . . . , n−1}. In our case, we use the cross entropy loss function, which is given by where w z denotes a weight by which the losses of each class are scaled. The weights are statically determined and are in our case chosen to be proportional to the inverse of the number of training examples for each class in order to counteract bias caused by unbalanced classes.

Training algorithm
The neural networks in our algorithm are trained by the ADAM training algorithm which is a refined version of stochastic gradient descent (SGD). In SGD, the average loss for a set X batch containing pairs of training inputs x and corresponding labels z is minimised by changing the randomly initialised parameters p of the neural network according to the update rule where γ is a tuning hyperparameter called the learning rate and ∇ p is the gradient operator with respect to the vector of parameters p. This minimises the average of the loss values h(f (x, p), z). The gradient is computed in an efficient manner via reverse-mode auto differentiation which is basically an application of the chain rule. This is also known as the backpropagation algorithm and more details on the process can be found in the literature, e.g. [23,Ch. 4].
The set X batch is called a mini-batch and is taken to be a subset of the set of all available training inputs X . The update steps are performed with changing disjoint mini-batches until the entire training dataset X is exhausted. Each pass through the entire set of available training data is referred to as an epoch. To enhance the training process (cf. [24]), for rich datasets we change the size of the mini-batches during training, later epochs use larger mini-batch sizes. The adaptive adjustments performed by the ADAM algorithm detailed in [25] provide further enhancements to this process.
In contrast to usual classifiers, our algorithm encapsulates the gradient descent algorithm in a decision process monitoring the necessity of dynamic reclustering which aims to optimise the classification accuracy. The complete algorithm is given in Algorithm 1 (cf. also Figure 1), the single steps are described in more detail in the remainder of this section.
Each time having initialised the neural network for separating the currently considered classes, the gradient descent optimiser is run until a training-progressmonitoring stop criterion is fulfilled (cf. training stop criterion in Section 3.3.2 for more details). The classification ability of the underlying neural net is evaluated by means of the so-called confusion matrices (cf. Section 3.3.1) throughout the entire training. If at the end of training all classes are evaluated with sufficient accuracy (cf. reclustering stop criterion in Section 3.3.2), the trained neural net is stored; otherwise a relabelling procedure according to the overall confusion matrix is conducted where the class with least average evaluation accuracy is merged into the class to which the corresponding inputs are most commonly misclassified during training and the neural net is re-initialised with respect to the updated classes (cf. Section 3.3.3). Among all stored neural nets, the ultimate classifier is chosen as the one having the maximum number of output classes (cf. Section 3.3.4).
In the subsequent subsections, the aforementioned reclustering process and stop criteria are described in detail.

Confusion matrix
In order to track the progression of classification accuracy during training, we record the confusion matrix During the experimentation, we observe that classes which are easily distinguishable can already be separated after very few training iterations, whereas classes sharing more similarity perform significantly worse in the beginning and also show a slower increase of evaluation accuracy during training. Taking into account that the evaluated value of the loss function commonly follows a convex decreasing trend throughout the entire training, the above observation motivates us to assess the separation ability of the underlying neural net during training by weighting the confusion matrix with the respective contribution to the training progress and to introduce the overall confusion matrix denoted by V (n) := (V ij,(n) ) i,j=0,...,n−1 and defined as where n, E (n) , and H (n) k refer to the current number of classes, the number of training epochs that are performed until the training stop criterion (cf. Section 3.3.2) is satisfied, and the average training loss during the k-th epoch, respectively.
In our setting, the confusion matrices serve as the key objects of the decision criteria for our dynamic reclustering (cf. Sections 3. 3.2 and 3.3.3). The definition of the overall confusion matrix in terms of (3.1) by taking the weighted average throughout the entire training and dropping the values from the initial epoch (k = 0) aims to mitigate the random effect of the initialisation of the neural network. Empirically, this yields robust reclustering results during different test runs for fixed n 0 .

Stop criteria
The criteria for stopping the loops are related to parametrised effectiveness and accuracy reqirements in the following manner: where n, E (n) , and V of the confusion matrix is the share of correctly classified training inputs in all training inputs labelled as i evaluated during the last epoch while training the classifier neural network with n existing classes. Therefore, for a pre-defined margin of error α ∈ [0, 1), the above reclustering stop criterion requires that at the end of training the corresponding classifier neural network should correctly classify the training inputs of each existing class at least at the rate of 1 − α.

Reclustering
As long as the recustering stop criterion is not fulfilled, the subsequent reclustering procedure is considered necessary.
For a current number of classes n and existing classes labelled as 0, . . . , n − 1, let i • and j • denote the worst evaluated class and the correponding most misassigned class during the entire training of the respective neural net (i.e. until the training stop criterion is fulfilled) which are defined as respectively (recall the definition of V (n) in (3.1)). The class labelled as i • is merged into class j • . Furthermore, since we always assume the labels to be consecutive, the training and validation inputs with the largest label n−1 are assigned the label of the dropped class i • . Each time after relabelling, the weights corresponding to the remaining classes in the cost function are adjusted to be again inversely proportional to the current shares of the classes in order to warrant a well-balanced training of the updated classifier and the neural net is re-initialised.

Final number of classes
In the context of anomaly detection, we are dealing with the trade-off between optimising the classification accuracy of normal data preventing false positives (i.e. to cancel confusing classes) and maintaining the ability of misclassifying abnormal data for the sake of anomaly detection (i.e. to still retain sufficiently many classes characterising different phases within a period). Keeping this in mind, the final number of classes determining the ultimate classifier neural network is selected in the following manner: Given a maximum allowed number of classes n * 0 with n * 0 an even number n * 0 > 3, the starting initial number of classes is set to n 0 := n * 0 . Each time for an updated initial number of classes n 0 , the relabelling procedure described in Section 3.3.3 is run at most (n 0 − 3)-times (i.e. with at least 3 remaining classes). If the reclustering stop criterion is fulfilled after relabelling ∆n n0times, the candidate final number of classes related to n 0 is set to n n0 := n 0 − ∆n n0 and the corresponding neural net is stored. If max n 0 =n * 0 ,...,n0 n n 0 ≥ n 0 −2, the updating processes of n 0 is finished; otherwise n 0 is reduced by 2. The overall final number of classes refers to the maximum of n n0 taken along the entire path of n 0 , i.e. max n 0 =n * 0 ,...,4 n n 0 and the final classifier neural network is the one stored when this overall maximum was achieved. If this maximum was achieved more than once, we choose the neural network corresponding to the highest n 0 such that n n0 achieved this maximum. This is because a high value of n 0 corresponds to a narrow sliding window (cf. Section 3.1.2) and hence maximises the sensitivity of the anomaly detector.
If in the end no suitable network has been stored, we increase α and rerun the algorithm.
Finally, it is worth mentioning that once all the hyperparameters are determined, the whole training algorithm introduced above, including data preprocessing and dynamic reclustering, is implemented in a machine-learning manner so that the classification and anomaly detection process can be accomplished fully automatically.

Anomaly detection
Once training is finished and in particular when the ultimate classifier neural network determined by the model selection process turns out to use initial number of classes n 0 and final number of classes n n0 , each test signal is pre-processed following the procedure described in Section 3.1 with respect to n 0 , labelled with respect to n n0 in accordance with the training and validation data (recall the relabelling step along with the dynamic reclustering described in Section 3.3.3), and then processed by the trained ultimate classifier neural network.
For problems of Type A described in Section 2.2, a minimum expected per-signal average classification accuracy δ (threshold value) should be set depending on individual needs. For instance, δ could be determined on the basis of classification accuracy on validation data. For each test signal {X t } t recorded over K periods of time with period begins {τ k } k=0,...,K−1 , if the normalised segmentsX (m) , m = 0, . . . , Kn 0 − 1 (recall Section 3.1.3), processed by the ultimate classifier neural net are evaluated with an average classification accuracy rate less than δ, i.e., if #{m | X (m) correctly classified}/Kn 0 < δ, then the signal {X t } t is considered abnormal.
Considering problems of Type B described in Section 2.2, if a normalised segment {X

Experiments
In this section we apply our machine-learning algorithm proposed in Section 3 to three example datasets choosing from the domains of cardiology, industry, and signal processing, aiming to show the feasibility of the method in a range of applications. The cardiology dataset is the most complex and challenging dataset representing problems of Type A described in Section 2.2, as the recordings taken from healthy control patients exhibit a high level of diversity which needs to be captured by the classifier. This diversity mandates the use of a more complex representation which is one of the strengths of deep neural networks over other parametric models. The other two datasets demonstrate the applicability of the method in different contexts, including the detection of anomalies occurring only at certain instances in time and thus representing problems of Type B described in Section 2.2.

Cardiology dataset
The PTB Diagnostic ECG Database is a database created by the Physikalisch-Technische Bundesanstalt (PTB) consisting of 549 electrocardiogram (ECG) recordings gathered from 290 subjects aged 18 to 87. The ECGs were recorded using a non-commercial PTB prototype recorder, the specifications of which can be found on the database website [2] . The dataset is part of PhysioNet [26] and is further described in [18].

Input data
We use 3/5 and 1/5 of the measurements from healthy patients for training and validation, respectively. The trained classifier is tested on the remainder of the data from healthy patients and data from all ill patients.
Due to the large data volume, we manually resample the input data to a sample rate of 50 samples per second instead of the original 1000 before feeding it into the neural network (i.e. the actual time unit applied in our training amounts to 1 time unit = 20 ms). This operation is not strictly necessary, but it speeds up the training process. Also, we only use the first 60 periods of each recording during training and for testing. We train our classifier with resampled time series from healthy patients and use the data coming from all 12 conventional leads and 3 Frank leads (cf. [27]) for the ECG diagnostic, resulting in a convolutional neural net with 15 channels on the input layer.

Period detection
The first challenge when analysing ECG data consists in detecting the randomly varying periods of individual patients, for which we design a period detector. This detector is described in greater detail in Appendix A. The detector has a number of parameters which need to be adjusted to the dataset, the actual values used here are given in Table 2. For this dataset, the entire time series for feature 'i' is used as both the reference and input time series to the period detector. However, in order to ensure the requirement that no trend component exists in the signal, the first difference of the signal is used instead of the raw signal. In order to adjust for the offsets thus introduced at peak detection, between Steps 4 and 5 described in Appendix A, the reference window is adjusted to be precisely centred on the corresponding peak in the original (smoothed but not differentiated) signal, i.e., its midpoint T k0 is changed to arg max The maximum allowed adjustment of 10 has empirically been found to yield satisfactory results. The median of all observed period lengths approximately amounts to s = 700 ms = 35 time units.

Hyperparameters
During the training, the maximum allowed number of classes and per-class margin of error are set to n * 0 := 10 and α := 2 −5 , respectively.
As per description in Table 1, each of the classfier neural networks encountered during the run consists of two convolutional layers with M (0) = 15 and M (1) = 90 channels, respectively, with max pooling of size R (1) = 3 applied in between, followed by two fully connected layers, and the output layer. During the classifier selection process, the length T (0) of the input sequence, the kernel sizes S (0) , S (2) of the convolutional layers, and the size N The ADAM optimiser with learning rate γ = 0.1 is employed for training with SGD. We start at a minibatch size of 800 and increase it after every 2 or 3 epochs up to 4800.

SCADA dataset
In [19], Antoine Lemay and José M. Fernandez describe a simulation of an industrial control system, specifically designed for providing supervisory control and data acquisition (SCADA) network datasets for intrusion detection research. The generated datasets are openly available on GitHub [3] and contain periods of regular operation, manual interactions with the system, and anomalies caused by network intrusions. Since the operation of the simulated system is cyclic, the resulting data is mostly periodic.

Input data
Among the available datasets with common characteristics, we choose the first 4/5 and the last 1/5 of the dataset named 'characterization modbus 6RTU with operate' with a duration of 5.5 minutes in total for training and validation, respectively, where neither the injected malicious activities nor the manual operations included are labelled, both resulting in a certain proportion of noise in the corresponding time series. The trained classifier is tested on the only three correctly labelled datasets 'moving two files modbus 6RTU' ('Test Data 1'), 'CnC uploading exe modbus 6RTU with operate' ('Test Data 2'), and 'send a fake command modbus 6RTU with operate' ('Test Data 3'), including no manual operations, a small portion of manual operations, and a large amount of noise e.g. manual operations (causing non-intrusion anomalies), respectively. In each dataset, four features are considered: number and total size of sent packets, and number of active IP address and port pairs. At onesecond intervals, we record the increase in each feature and consider the corresponding 4-dimensional time series.
The given 10-seconds polling interval yields periodic characteristics of the considered time series with a fixed period length of s = 10 seconds.
According to Table 1, all convolutional neural networks considered during the entire run include M (0) = [3] https://github.com/antoine-lemay/Modbus_dataset 4 and M (1) = 24 channels on the first and second convolutional layers, respectively, and two fully connected layers placed between the last convolutional layer and the output layer. Considering the short input sequence length T (0) = 3s/n 0 = 30/n 0 with n 0 taking values in {10(= n * 0 ), 8, . . .}, we do not apply any max pooling, i.e. R (1) = 1. During the classifier selection process, the sizes S (0) , S (2) and N (3) of the connvolution kernels and the first fully connected layer, respectively, vary proportionally to the input length T (0) . The size of the output layer N (5) is equal to the current number of classes n which runs over the values in {n 0 , n 0 − 1, . . .} during the dynamic reclustering and the size of its preceding fully connected layer N (4) is the geometric mean of N (5) and N (3) .
The ADAM optimiser with learning rate γ = 0.01 and a mini-batch size of 4 are used for training with SGD.

Wave dataset
The waves dataset is a synthetic dataset loosely modelled on a system transmitting a periodic signal. From the theory of Fourier analysis, every differentiable periodic signal {x t } t with frequency f can be decomposed into its frequency components cf. [28, Theorem 2.1], which motivates the principal rule of our wave generator. In our consideration, the generated waves have no DC offset, i.e. a 0 := 0, and components only up to frequency 4f , i.e. a k := 0 for all k ≥ 5. The signals are supposed to be transmitted over a noisy channel which we assume to add filtered Brownian and white noise. The wave generator also has some inherent randomness in the form of clock jitter, amplitude noise, and phase noise. There are also a number of fault conditions which form the basis of the anomalies to be detected.

Wave generator
The waves in this dataset are of the form t = 0, 1, 2, . . . , with T t given by } t are independent (discrete) Ornstein-Uhlenbeck processes with individual sets of parameters. In general, an Ornstein-Uhlenbeck process {R t } t obeys the stochastic differential equation where µ ∈ R, σ > 0, θ ∈ [0, 1], and {W t } t is a standard Brownian motion, cf. e.g. [29,Ex. 6.6]. In discrete time, a process {R t } t=0,1,2,... following (4.1) can be approximated by generating i.i.d. random vari-ablesÑ 0 ,Ñ 1 ,Ñ 2 , . . . withÑ t ∼ N (µ, (σ/θ) 2 ) for all t = 0, 1, 2, . . . and exponentially smoothing them: Indeed, letting is a random walk with Gaussian increments and thus corresponds to a discretely sampled standard Brownian motion [30, (1.9)]. Therefore, (4.2) can be written as which yields a discrete counterpart of (4.1). The Ornstein-Uhlenbeck process can be thought of as a process performing a random walk where the increments are biased towards the mean µ. As such, it behaves locally like a Brownian motion, causing the power of the higher frequency parts of its spectrum to average 1/f 2 (brownian noise). The process can be used to model parameters of systems that tend to shift over time, while generally remaining close to a certain average value. For each single wave, a set of parameters controlling the governing processes is randomly generated using the parameters in Table 3. The means of the processes for amplitude and phase variation are sampled according to the following law: log 2 µ amp k ∼ U (−1, 1) and µ ph k ∼ U (0, 1) Table 3: Parameters for processes governing generated waves (k = 1, 2, 3, 4) U (a, b) denotes the uniform distribution on the interval [a, b). They remain constant throughout the wave and determine the overall shape of the wave.

Generated anomalies
Based on the parameters and processes employed by the wave generator, we inject the following four types of anomalies or noise: Amplitude anomalies The amplitude process {R amp k t } t of one of the frequency components (i.e., for a single k ∈ {1, 2, 3, 4}) is increased by a, where a is randomly sampled for each anomaly according to the law log 2 a ∼ U (1, 2). Phase anomalies The phase process {R ph k t } t of one of the frequency components is changed. The amount of change is randomly sampled for each anomaly from the distribution U (1/4, 3/4) resulting in a random phase change of at least 90 • and at most 270 • . Pulse anomalies A pulse of random amplitude is added onto the wave. For each anomaly, the amplitude p of the pulse is randomly sampled according to the law log 2 p ∼ U (2, 4) and the pulse width is a random integer drawn from the interval [2 5 , 2 6 ). White noise The white noise process {N t } t is amplified by a factor α which is randomly sampled for each anomaly according to the law log 2 α ∼ U (2, 6). For each wave, a segment of 2 16 samples is generated. Then 16 segments, each consisting of 2 12 samples are generated, the last 2 11 samples of which the anomaly or noise is injected into. For the evaluation, we use 24 generated waves, resulting in a number of 290 anomalies and 94 waves with increased white noise in the test dataset.

Input data and period detection
The generated waves are considered in 24 groups, where each group consists of a normal wave recorded over 2 8 = 256 periods and further recordings, each injected with a single type of anomaly with a normal start-up time of at least 2 11 = 2048 time units (i.e. the first entrance time of anomalies following the respective normal wave is to the right of the time stamp 2 11 = 2048). In each group, we take the first 7/8 and the remainder of the normal wave for training and validation, respectively, and subsequently test the trained classifier on the respective anomaly-injected test recordings.
Since the simulated waves contain interference in the time component which results in random period lengths s, we again make use of the period detector decribed in Appendix A using the parameters specified in Table 4. Note that in contrast to the treatment of ECG data, in each data group the reference window is selected among the subpatterns extracted from the training data. By construction, the average period length equals s = 2 8 = 256 time untis.

Hyperparameters
Throughout the entire training, we set the maximum number of classes and allowed per-class margin of error to n * 0 := 10 and α := 2 −6 , respectively. As presented in Table 1, for each of the 24 waves the corresponding classifier neural nets are all endowed with M (0) = 1 channel and M (1) = 6 channels on the first and second convolutional layers, respectively, where max pooling of size R (1) = 3 is applied between the covolutional layers, and two fully connected layers are set between the last convolutional layer and the output layer. During the classifier selection process, the length T (0) of the input sequence, the sizes S (0) , S (2) of the convolution kernels, and the size N (3) of the first fully connected layer vary proportionally to the sliding window length T = 3s/n 0 = 768/n 0 where n 0 runs over the values in {10(= n * 0 ), 8, 6, 4} if not stopped earlier. The size N (4) of the second fully connected layer is the geometric mean of the sizes N (3) and N (5) of its adjecent layers and the size N (5) of the output layer is equal to the current number of classes n which runs over the values in {n 0 , n 0 − 1, . . .} during the dynamic reclustering.
The ADAM optimiser with learning rate γ = 0.01 is employed for training with SGD. The mini-batch sizes are dynamically increased after every 2 or 3 epochs from 40 to 360.

Experimental results
In this section we present the empirical results of the treatment of the example datasets given in Section 4 following our general phase classification scheme described in Section 3. Here we provide both the results of selecting and training the optimal classifier neural networks and the results of anomaly detection obtained by evaluating the trained classifier neural networks on the test data (recall Section 3.4).

Cardiology dataset
The ultimate classifier resulting from the dynamic model selection process turns out to be a classifier neural network corresponding to initial number of classes n 0 = 6 and final number of classes n n0 = 4, cf. Table 5 for the layout of the final CNN. The label his-  tory recorded along with the dynamic reclustering is shown in Table 6. The average validation loss recorded during the training of the respective neural nets is presented in Figure 2. A training accuracy of 99% and a validation accuracy of 96% are achieved. Figures 3 and 4 illustrate the result of testing the trained classifier on three patients from the category 'healthy control' and three ill patients: the measurements on feature 'i' from the test patients are presented in a temporal resolution of 20 ms and the bars in the upper and lower halves of the figures refer to the predicted classes and the true labels of the segments from the considered test signals, respectively. [4] [4] Note that here and in the sequel the coloured bars in these diagrams are always plotted between the beginnings of adjacent segments to be classified, thus only covering approximately the first third of each segment.  Figure 5 presents a statistical evaluation of the per-patient test results on patients from the 7 most recorded categories in the considered database: 'dysrhythma', 'valvular heart disease', 'cardiomyopathy/ heart failure', 'bundle branch block', 'hypertrophy', 'myocardial infarction', and 'healthy control'. The lines in different colors represent the empirical distribution functions of the per-patient classification accuracy from the aforementioned categories. Observe that the blue line related to healthy patients is located in the bottom right corner of the diagram, to the left of which all other lines corresponding to ill patients are centred (cf. the median for each category), which enables us to distinguish ill patients from healthy patients in some cases. For instance, according to the figure, if we take the average validation accuracy of 96% as the threshold for the per-patient classification accuracy, all test patients from the categories 'dysrhythma' and 'valvular heart disease', 90% and nearly 85% of the patients from the categories 'cardiomyopathy/heart failure' and 'myocardial infarction', respectively, and over 70% of the patients from the categotries 'bundle branch block' and 'hypertrophy' will be considered as anomalies, whereas up to three false positive results (25% of) all tested patients from the category 'healthy control' will be assessed as normal. Since the sample sizes provided for the individual categories vary a lot (e.g. there are 148 subjects for 'myocardial infarction' whereas the entire category 'healthy control' consists of only 52 subjects including training, validation and test data applied in our context), we are not in the position to make a general statement on the choice of an ideal threshold value. Table 7 provides a statistical evaluation of the per-disease average classification accuracy. It turns out that the category 'healthy control' presents by far the best test result compared to all other categories related to heart disease (anomaly).
Note that our anomaly detection scheme does not incorporate any specific cardiological knowledge. It gives    an indication whether a patient may be ill or not, it detects deviations from the known healthy data and does not classify the diseases separately. It also only gives a statistical indication, which is a result somewhat similar to the one reported in [31] where it was observed that the ECGs of ill patients showed deviations in certain affine dependencies usually present between the 12-lead and 3-lead ECGs of healthy patients.

SCADA dataset
The final classifier determined by means of the dynamic model selection scheme uses n 0 = 10 and n n0 = 4, cf. Table 8 for the layout of the final CNN. The  respective label history recorded during the dynamic reclustering and the evolution of the average validation loss are presented in Table 9 and Figure 6, respectively.
In Figure 7, the number of active port pairs extracted from 'Test Data 1' is plotted against time (in seconds) and the bars in the upper and lower halves represent the classes predicted by our trained neural net and the true labels of the test segments, respectively; segments which result in prediction errors are considered anomalies.  The final results of our anomaly detection algorithm on the entire test data are summarised in Table 10. In the first two (cleaner) test datasets with no or only Indeed, the SCADA datasets which are applicable in our setting are quite small. Due to the noncompatability between datasets with small and large amounts of noise (i.e. non-intrusion anomalies appearing in the form of pulses), it is difficult to choose one suitable dataset for training and to test the intrusion detector on datasets with incompatible characteristics, e.g., it would be unfeasible to train an anomaly detector on one of the cleaner datasets and then test it against a noisy dataset, or vice versa. For a more extensive treatment of anomaly detection of Type B described in Section 2.2 using a richer dataset and the corresponding results, cf. Section 4.3 and Section 5.3.

Wave dataset
Overall, an average classification accuracy of 99% is achieved on both training and validation data. Figures 8,9,10 and Figure 11 present the detection results of our classifiers trained by individual example waves and tested on segments injected with different types of anomalies and white noise, respectively.
Again, in each diagram the bars in the upper and lower halves refer to the predicted classes and true labels of the data from the test segments fed into the trained classifier, respectively. Notice that in Figure 11, slightly increased white noise does not lead to any classification errors, which suggests some robustness property of our classifier against noise.
The final results of our anomaly detection algorithm tested on the 24 groups of synthetic waves are shown in Table 11. The amount of anomalies and white noise are obtained by counting the number of test waves injected with the respective type of interference, whereas the denominator for evaluating false positives equals  the number of available prediction windows (test segments) in the clean test data. Overall, our algorithm yields high detection rates of all types of injected anomalies (99% on average); the small rate of false positives (< 1%) confirms the model adequacy of our phase classification scheme; the low error rate in the presence of increased white noise shows the robustness of our classifier neural networks against noise to a certain extent.

Conclusion
In this paper, we proposed a novel approach to detecting anomalies in time series exhibiting periodic characteristics, where we applied deep convolutional neural networks for phase classification and automated phase similarity tagging. We evaluated our approach on three example datasets corresponding to the domains of cardiology, industry, and signal processing, confirming that our method is feasible in a number of contexts. Step 3: The reference signal is now fed into a simple peak detector which proceeds to inductively find peaks T k spaced approximately one base period apart via where σ ∈ [0, 1) is a tolerance value to account for the variability of period lengths in the signals.
Step 4: The detector now extracts subpatterns {U (k) t } ŝλ t= −ŝλ from the reference signal Y t centred at the peaks T k , i.e., U (k) t = Y T k +t . Here, λ ∈ (0, 1/2] is another tolerance parameter to mitigate the effects of period length variability. Then the seasonal means are computed. Here M denotes the total number of subpatterns. Let now t= −ŝλ is the subpattern with maximum similarity to the mean {Ū t } ŝλ t= −ŝλ and is thus suited as a reference pattern.
Step 5: The reference pattern is now used for detecting the periods in the input signal by computing the cross-correlation function: Step 6: Finally the simple peak detector from Step 3 is applied to the cross-correlation {C τ } τ to obtain the final segment beginnings.
A comparison of the periods detected by the simple peak detector from Step 3 and the cross-correlating period detector from Step 6 can be seen in Figure 13. The top graph shows the input to the simple peak detector, the bottom graph shows the cross-correlation; the gray boxes in the top half of the backgrounds represent the segments inferred by the simple peak detector, those in the bottom half represent those found by the crosscorrelating period detector. Notice how glitches in the input signal easily manage to confuse the simple peak detector while the cross-correlating period detector is robust to such perturbations.

Appendix B: Comparison with other methods
In this section we perform some comparative evaluation of other methods in order to highlight in particular the utility of phase classification via convolutional neural networks for anomaly detection. We consider two classes of methods: distance-based approaches employing various types of Euclidean distance comparison (cf. Sections B.1 and B.2) and one-step ahead forecasting (cf. Section B.3). In the first class of comparison we demonstrate that even in a phase classification framework, simply comparing the Euclideantype norm of segments of the underlying signals is less suited for capturing the essence of complex and noise corrupted data. In the second class of comparison we show that even with the highly complex parametrisation of LSTMs, anomaly detection based on one-step ahead prediction is prone to false positive results. Since the ECG dataset exhibits the highest level of diversity and is thus most difficult to treat among all example datasets introduced in Section 4, for this demonstration we only evaluate the reference methods on this dataset.
B.1 Self-similarity approach One way of detecting anomalies in periodic signals is to take a sliding window of roughly one period length, normalise it, and look for a similar segment in the data preceding the window by e.g. one to two periods. A threshold is then used to determine whether the considered window is similar enough to one of the preceding segments. This principle is used in the so-called matrix profiles approach, cf. e.g. [9]. No training data is used in this method and thus no particular characteristics of the normal data themselves are employed during the anomaly detection. The only point where training data can be useful in this approach is to determine the similarity threshold mentioned above, choosing it so as to avoid having too many false positives on non-anomalous data.

B.1.1 Method
Formally, if {X t } t is the input signal and T is the window length, normalise each segment X (τ ) := {X τ +t } t=0,...,T −1 for τ ≥ 0 in an analogous manner to that of Section 3.1.3 and denote byX (τ ) , τ ≥ 0, the respective normalised segments. Now choose a minimum shift d min and a maximum shift d max and compute for each τ ≥ d max a τ := min τ =τ −dmax,...,τ −dmin where X (τ ) −X (τ ) denotes the Euclidean distance ofX (τ ) toX (τ ) , i.e., Now depending on the type of problem, there are two ways to decide whether a signal is anomalous: If the task is one of Type A described in Section 2.2, the average self-dissimilarity of the test signal is computed and compared against some threshold which can for instance be determined by the average self-dissimilarities of the training signals. If on the other hand the task is one of Type B described Section 2.2, a threshold is chosen close to the maximum self-dissimilarity of the known normal part of the signal and the selfdissimilarity for the remaining part of the signal is compared against this threshold.

B.1.2 Results
For the sake of comparison, we evaluate the performance of the self-similarity-based approach on the ECG database in a similar manner as in our main result in Section 5.1 and first transform the selfdissimilarity computed as described above into a selfsimilarity rating via the transformation x → 1/(x+1). We then average the self-similarities for each recording and plot the distributions of these averages grouped by disease. This plot is shown in Figure 14a. One can clearly see that, apart from patients of the category 'dysrhythmia' which have the lowest self-similarity, this approach does not manage to produce any separation of ill from healthy patients.

B.2 Distance-based phase classification
A distance analysing method similar to that of Section B.1 but more closely related to our main approach is to compute reference windows for the different phases of non-anomalous signals and use these to classify the corresponding segments of the other signals by assigning the class whose reference window has maximum similarity. Basically, this is the same method as our phase classification scheme but with the classifier neural network replaced by a simple nearest reference classifier.

B.2.1 Method
For this method the same data pre-processing with respect to a chosen number of classes n 0 as described in Section 3.1 is applied to training, validation, and test signals.
For the training dataset X consisting of normalised segmentsX θ labelled as belonging to class θ for θ = 0, . . . , n 0 − 1 (recall the set X in Figure 1 for both types A and B), we compute for each class of phase θ the seasonal averages where again · denotes the Euclidean norm as described in Section B.1.
The remaining part of anomaly detection is performed as in Section 3.4.

B.2.2 Results
To evaluate the performance of the distance-based phase classier on the ECG database, just as in our main result in Section 5.1 we record the classification accuracy on the different types of heart disease and analyse the distribution of the per-patient classification accuracy grouped by the corresponding disease. The separation into training, validation, and test data is the same as in our main experiment on the ECG database (cf. Section 4.1.1). For the number of classes, a setting of n 0 = 6 shows the best results, which conforms to our model selection result (cf. optimal n 0 presented in Section 5.1). The average validation accuracy amounts to 79%. The per-disease average classification accuracy is evaluated in Table 12. A plot of the distri- bution of per-patient classification accuracy evaluated on test patients from different categories is shown in Figure 14b. As can be seen from both the table and the plot when compared to the results of our approach (cf. Figure 5 and Table 7), the convolutional classifier neural network delivers generally better classification performance with far better results being obtained on the healthy control patients. In particular, we see that the blue line representing the healthy test patients in Figure 5 is located much closer to the bottom right corner than in Figure 14b, indicating better modelling of the normal data by our convolutional classifier neural network. An anomaly detection using the distancebased classifier thus would have a higher false positive rate than one using the convolutional neural network for classification when achieving comparable detection performance. For instance, according to Figure 14b, if we use the average validation accuracy of 79% as the threshold value as discussed in our main result in Section 5.1, a similar detection rate in most of the ill categories but a higher false positive rate of 38% on healthy test patients will be achieved compared to the result of our approach (25% on healthy test patients, cf. Section 5.1).
B.3 Long short-term memory predictor approach As described in Section 2.3.3, one can use a long shortterm memory unit (LSTM) to predict the signal one time step ahead, then use a threshold on the difference of this prediction to the actual signal to decide whether the signal behaves as expected or should be considered anomalous. We choose to demonstrate this method in preference to the statistical forecasting approaches mentioned in Section 2.3.2, as no further adjustment to the method is needed for handling problems of Type A described in Section 2.2 and, more importantly, the forecasting performance of LSTMs on data with complex patterns has been shown to be better than that of linear models in general.

B.3.1 Method
Since LSTMs are a somewhat complex type of recurrent neural network, we will not describe their construction here and instead refer the reader to the literature on the subject, e.g. [32]. In our treatment of the ECG database, we use an LSTM with an input layer size of 15, a hidden layer size of 60, and an output layer size of again 15. We use a mean-squared-error loss function, discarding the first 200 predictions (four seconds) to allow the LSTM to first align with the given signal. We use the same ADAM algorithm for the training of the LSTM that we also employed for training our convolutional classifier neural networks with a learning rate of γ = 2 −10 and 2 10 training epochs. The separation into training, validation, and test data is also the same as in our main experiment on the ECG database (cf. Section 4.1.1).

B.3.2 Results
To evaluate the performance of the LSTM on the ECG database, we analyse the distribution of the forecasting precision on the patients coming from the different groups. A plot of this distribution is shown in Figure 14c. The measure of performance used here is given by 1/(M SE + 1) where M SE denotes the mean squared error of the predictions on the ECG recording. This transformation is applied again for the sake of easier comparability with our main result in Section 5.1. Using the same measure of performance, the prediction precision evaluated on the training and validation data are 91% and 63% on average, respectively. The large gap between the training and validation performance suggests the presence of the overfitting phenomenon mentioned in Section 2.3.3, whereas our classifier CNN approach does not suffer from this problem (see the consistency of training and validation accuracy results presented in Section 5.1). As presented in Figure 14c, if we choose a threshold value of 63% based on the validation performance as discussed in our main result in Section 5.1, this will lead to a false positive anomaly detection rate of 31% on healthy test patients, which is higher than that of our approach (25%); at the same time, the detection performance of the LSTM-based detector is lower, with e.g. only about 75% of the patients labelled 'myocardial infarction' (the largest category) being detected as anomalous, compared to the almost 85% of our approach. Furthermore, for illustration purposes, two examples of the predictions coming from the LSTM are displayed in Figure 15. Notice that for both patients, the pre-

List of Abbreviations
ARIMA autoregressive integrated moving average ARMA autoregressive-moving average CNN convolutional neural network DC direct current ECG electrocardiogram i.i.d. independent and identically distributed IP Internet Protocol LSTM long short-term memory MSE mean squared error PTB Physikalisch-Technische Bundesanstalt RNN recurrent neural network SARIMA seasonal autoregressive integrated moving average SCADA supervisory control and data acquisition SGD stochastic gradient descent