Robust energy disaggregation using appliance-specific temporal contextual information

An extension of the baseline non-intrusive load monitoring approach for energy disaggregation using temporal contextual information is presented in this paper. In detail, the proposed approach uses a two-stage disaggregation methodology with appliance-specific temporal contextual information in order to capture time-varying power consumption patterns in low-frequency datasets. The proposed methodology was evaluated using datasets of different sampling frequency, number and type of appliances. When employing appliance-specific temporal contextual information, an improvement of 1.5% up to 7.3% was observed. With the two-stage disaggregation architecture and using appliance-specific temporal contextual information, the overall energy disaggregation accuracy was further improved across all evaluated datasets with the maximum observed improvement, in terms of absolute increase of accuracy, being equal to 6.8%, thus resulting in a maximum total energy disaggregation accuracy improvement equal to 10.0%.


INTRODUCTION
In the last decades rising energy consumption needs within residential and industrial environments have become a crucial issue with nowadays consumer households accounting for approximately 40% of the total worldwide consumed energy [1,2]. With the development of information and communication technologies (ICT) and the increasing usage of electrical appliances and automation of tasks the electric power needs will grow further and the number of electrical appliances per household will significantly increase within the next 20 years [1,2]. Despite the expected increase in total energy consumption, studies estimate that 20% of households' consumed energy could be saved by changing consumers' behaviour and improving the existing poor operational strategies [3,4]. Furthermore, the establishment of smart grids and demand management as well as the fluctuation of power generation due to an increasing percentage of renewable energies are enhancing the issue of increasing energy needs [5,6]. These changes in energy demand and generation are challenging for network operators and power generation facilities, since power needs are becoming less stable and unpredictable while rising at the same time [7,8]. To address those challenges accurate and fine grained monitoring of electrical energy consumption within residential environments is needed [2,9] as well as proper demand management [10]. However nowadays energy monitoring is mostly done via an aggregated measures of energy consumption in the form of monthly bills and therefore does not address the above mentioned issues.
To measure energy consumption smart meters are used. A smart meter, also referred to as a smart plug, is a device used to measure electrical power/energy consumption with resolution in the order of seconds to minutes. Smart meters measure the voltage-drop over the device/circuit and the current flowing through the device/circuit with an arbitrary sampling frequency which usually varies from 1 60 ⁄ Hz to 30 kHz [11]. Higher sampling frequencies are usually preferred, since they contain more detailed information about the energy consumption, however they increase linearly the amount of acquired data and exponentially the cost of hardware [12]. With the sampling rate in the order of seconds data handling for several months/years becomes feasible and hardware costs are relatively low. However with the ability to provide real-time information through smart-metering and determining detailed household energy consumption, consumer privacy concerns are arising and energy data protection becomes prominent [7,13]. To address these issues, energy monitoring must be carried out cost effectively and under the consideration of privacy concerns.
According to [14] the largest improvements in terms of energy savings can be made when monitoring energy consumption on device level to detect faulty device operation and inefficient or suboptimal operational strategies. To measure energy consumption on device level, energy has to be measured either for each device separately using one sensor per device or the aggregated energy (combined energy of several devices measured at one central point e.g. the power inlet of a household) has to be disaggregated into device level using computational algorithms. When only using one sensor to disaggregate the total consumed energy and extract energy consumption on appliance level the task is referred to as Non-Intrusive Load Monitoring (NILM) as introduced in [15]. NILM formulates the energy disaggregation problem as a single channel source separation problem, where the smart meter is the only input channel measuring the total power consumption and the goal is to find the inverse of the aggregation function to calculate consumption per device. Comparing to Intrusive Load Monitoring (ILM), NILM has the advantage of requiring less hardware (ILM uses one smart meter per device) as well as meets consumers acceptability with respect to privacy conserving [7,13] .
In general NILM assumes that there is a single observation (smart meter measurements) and multiple unknowns (electrical devices) making the disaggregation problem highly under-determined and difficult to solve without any further constraints. Therefore several approach for disaggregation have been proposed, which can be briefly split into methods with and without Source Separation (SS). Approaches without SS are based on the decomposition of the aggregated signal to a sequence of feature vectors, which will be classified to device labels by a Machine Learning (ML) algorithm (e.g. Artificial Neural Networks (ANN) [16], Decision Trees (DT) [17], Hidden Markov Models (HMM) [18], K-Nearest Neighbours (KNN) [19], Support Vector Machines (SVM) [20]) or by a predefined set of rules and thresholds [21,22]. Furthermore, recent research in deep learning and big data has led to a significant increase of use of data-driven approaches using large scale datasets (e.g. AMPd [23]). Approaches based on Convolutional Neural Networks (CNNs) [24][25][26], Recurrent Neural Networks (RNNs) [27,28] and Long Short Time Memories (LSTMs) [27,29] have been proposed in the literature, while denoising Auto Encoders (dAEs) [30] and Gate Recurrent Units (GRUs) [26] have also been used. Approaches with SS are based on single channel source separation algorithms (e.g. non-negative matrix factorization [31], sparse component analysis [32]) to extract the consumption of each device from the aggregated signal by using additional constraints (e.g. sparseness or sum-to-one [33]) during the optimization procedure. The features extracted from the aggregated signal in approaches with and without SS strongly depend on the sampling frequency, with either macroscopic (for low sampling frequency) or microscopic (for high sampling frequency) features being extracted. Macroscopic features are mainly active and reactive power, while statistical values of from the active or reactive power (e.g. mean, median, variance or energy) can be estimated as well [34]. Microscopic features can be current harmonics or transient energy [21,35] and require high-sampling frequency to be calculated (1 kHz and above).
Several NILM approaches with and without SS have been proposed in the literature. In these approaches one or multi-state electrical devices have been modelled by finite-state machines, i.e. with steady energy consumption behaviour per operational state [15]. In contrast to one/multi-state devices, there is no established approach in detecting appliances with continuous power consumption or with non-linear behaviour and highly varying power signature [36,37]. Researchers have addressed this issue by using high frequency features or wavelets to detect transient device behaviour, which however have the drawback of higher cost in hardware and increased computational power needed [12,37,38]. Therefore most approaches use disaggregation algorithms with sampling rates in the order of seconds to minutes, in addition with temporal information (e.g. Factorial Hidden Markov Models (FHMM) [18,39]) to identify appliances with varying power consumption [12,40]. Furthermore special filtering techniques (e.g. Kalman filters [41]) with time varying coefficients and probabilistic approaches using appliance grouping [42] have been proposed to address the issue of modelling devices with continuous or non-linear characteristics.
In this paper we propose the integration of temporal contextual information for each electrical appliance in the form of concatenation of adjacent feature vectors within a device-dependent time window to improve device detection performance in NILM. The remainder of this paper is organized as follows: In Section 2 the proposed NILM approach using temporal contextual information per device is presented. In Section 3 the experimental setup is described and in Section 4 the evaluation results are presented. Finally the paper is concluded in Section 5.

METHODS
NILM energy disaggregation can be formulated as the task of determining the power consumption on device level based on the measurements of one sensor, within time window (frame or epoch). Specifically, for a set of − 1 known devices each consuming power with 1 ≤ ≤ , the aggregated power measured by the sensor will be where = is a 'ghost' power consumption usually consumed by one or more unknown devices. In NILM the goal is to find estimations � = {̂, ̂} of the power consumption of each device using an estimation method −1 with minimal estimation error and ̂=̂, i.e.

Baseline NILM architecture
As baseline NILM approach we consider a data-driven energy disaggregation methodology without the use of SS techniques, adopted in several publications found the literature [39,[43][44][45][46]. The baseline NILM consists of pre-processing of the aggregated signal, then decomposition of the sequence of frames to a sequence of feature vectors followed by processing from a classification/regression algorithm using pre-trained appliances' models to determine device operation as shown in Figure 1.  During the pre-processing step filtering and/or down-sampling is performed and then the signal is frame blocked. Framing can be done either with constant or with variable frame-length [35,47]. In the state-based baseline NILM approach in order to estimate device consumption on state level a regression algorithm instead of a classification algorithm is used [48,49], while classification is used in event-based approaches to detect devices' On/Off states [39,45,46].

Proposed NILM architecture
The proposed methodology uses a two-stage disaggregation scheme, with the first stage performing power consumption estimation for each device by extending the baseline NILM architecture to using Temporal Contextual Information (TCI) and the second stage fusing the estimation results of each device using a regression model. The block diagram of the proposed two stage NILM architecture using TCI is illustrated in Figure 2.  Figure 2: Block diagram of the NILM architecture using device dependent temporal contextual information (TCI).
Similarly to the baseline NILM the aggregated power consumption signal is initially pre-processed and a feature vector , ∈ ℝ , is extracted for every frame ℎ , with 1 ≤ ≤ , where is the total number of frames. During stage 1 the feature vectors are expanded to using their adjacent ones, thus creating a temporal contextual window of length equal to = 2 + 1 concatenated frames, i.e.
where is the temporal contextual information expansion function for the ℎ device and is the expansion for the ℎ device and the ℎ frame. The TCI expansion is performed separately for each device m using its optimal temporal contextual information = { }, with being calculated offline on a bootstrap training dataset. The expanded feature vector Cm of each device m is then processed by a regression model () and the output of stage 1, ̂′ , is an initial estimation of the power consumption of each device: The power consumption estimations, � ′ ∈ ℝ , of the devices from stage 1 are used together with the feature vector, , in order to calculate enhanced estimations of the power consumptions of the M devices. In detail, in the second stage regression models are receiving as input the power consumption estimates � ′ from stage 1 and the initial feature vector . The use of the device estimates � ′ allows the second stage regression model estimators to model power consumption correlations between different devices. In both stages 1 and 2 the regression models of the devices operate in parallel and separately for each device. The proposed methodology combines the integration of temporal contextual information with the device specific operation of each of the appliances, thus capturing temporal information individually for each appliance and learning it by the regression model.

EXPERIMENTAL SETUP
The proposed two stage NILM architecture with the device dependent temporal contextual information presented in Section 2 was evaluated using a number of publicly available datasets and a deep learning algorithm for regression. The datasets and parameters set for deep learning regression are presented below.

Databases
Three different publicly available databases were used, namely the ECO [50], the REDD [51] and the iAWE [52] database. The ECO and REDD databases consist of different datasets with each of them containing power consumption recordings from different houses, while iAWE database consists of recordings from one house. The evaluated datasets are tabulated in Table 1 with the number of appliances denoted in column '#App'. In the same column, the number of appliances in brackets is the number of appliances after excluding devices with power consumption below 25 W (indicated in red), which were added to the power of the 'ghost device', similarly to the experimental setup followed in [53,54]. The next three columns in Table 1 are tabulating the sampling period , the duration and the appliance types of each evaluated dataset. (1) lighting, (2) furnace, (3) kitchen-outlets, (4) outlets-unknown, (5) washer-dryer, (6) stove, (7) air-conditioning, (8) air-conditioning, (9) miscellaneous, (10) smoke-alarms, (11) lighting, (12) kitchen-outlets, (13) dishwasher, (14) bathroom, (15)  The appliances type categorization is based on their operation as described in [55,56], i.e. one-state devices have only on/off status (e.g. resistive lamps, kettles or fridges without significant power spikes), multi-state devices have several discrete power consumption states (e.g. washing machines including different washing cycles), non-linear loads (e.g. electronics) and devices with continuous power consumption signature, which are controlled by power electronics (e.g. air condition) and usually have an exponential decay pattern. In all appliance types a peak might appear at the beginning of their signature, e.g. in refrigerators. Characteristic examples of the power consumption signatures of each of the four appliance types are illustrated in Figure 3. The ECO-3 and REDD-5 datasets were excluded as ECO-3 contains only the aggregated signal and not the power consumptions per device thus there is not ground truth to evaluate NILM approaches [50] and REDD-5 has significantly short monitoring duration [57]. Regarding the size of the evaluated data, the whole REDD database was used (ignoring the gaps in the measurements as in [58]), while one week of data was chosen for the ECO and iAWE databases to have similar amounts of training samples as in the REDD dataset. In detail, the week from the 5th of July till the 11th of July 2012 was selected from the ECO database while the week from the 8th of June till the 14th of June was selected for the iAWE database respectively. These particular weeks were selected in order as many as possible devices to appear in the aggregated signal and since in previous papers using the ECO and iAWE databases [44,50] the time interval used has not been reported. In Table 2 the appliances from each dataset are categorized according to the four different appliances types mentioned above. The categorization is done with respect to the electrical properties of the appliances and their corresponding power consumption signatures. In addition the percentage of the total energy per appliance type in each dataset is given. The id number of appliances (columns 'App') correspond to the appliances of each dataset as denoted in Table 2.  As can be seen in Tables 1 and 2 the number of appliances as well as the appliance type in the evaluated datasets are varying. In particular, the number of appliances vary from six (ECO-1) to 18 (REDD-3) while the number of appliance types vary from two (REDD-2) to four (REDD-4/6), thus the 11 evaluated datasets include different device combinations and characteristics, which are representative of modern households. Common in all datasets is their relatively low sampling period (1-3 sec) and the consideration of active power samplings only, resulting to computational simplicity and runtime advantages [59]. Furthermore all three databases were recorded within the last decade meaning that the households used were equipped with recent device technology [50,51].
In our experimental setup the real aggregated signal (which includes ghost power from unknown devices) was used to evaluate the performance of the proposed NILM methodology, thus making the experimental setup identical to real life conditions. Specifically, the input aggregated power consumption signal we used was the originally measured by the smart-meter (one sensor only) during data acquisition (similarly to [60]) and not an artificially generated aggregated signal created by adding the power consumption signals from a manually selected closed-set of devices (synthesized data), as in [29,[61][62][63], which was criticized in [64] for not corresponding to real-world conditions.

Pre-processing and Feature Extraction
During pre-processing the aggregated signal was frame blocked in frames of 10 samples with overlap between successive frames equal to 50% (i.e. 5 samples). For every frame a feature vector consisting of the mean, root mean square, standard deviation and peak to root mean square value was calculated, similarly to [65], resulting to feature vectors of dimensionality equal to four. In detail the mean value is used as the most general information about the energy consumption in each frame, while the root mean square value is used as a filtered version of the mean value smoothing outliers and small changes (noise) in the power consumption signal [65]. Moreover the standard deviation is used in order to capture sudden changes of the power signal within a frame i.e. changes of device states, while the peak to root mean square value is selected to capture the maximum change in power normalized to the root mean square value of the frame in order to have a quantitative measure of change in power within each frame [65]. In order to consider temporal contextual information expanded feature vectors were extracted by concatenating to each feature vector the preceding and the succeeding vectors as described in Section 2.
For the regression models of stage 1 feed-forward Deep Neural Networks (DNNs) were used. In detail, the DNN consisted of three hidden layers with 32 sigmoid nodes per layer. The number of layers and nodes were empirically selected after evaluation on a bootstrap training subset with artificially generated aggregated data (removed ghost power) as shown in Table 3. A "one vs. all" regression approach was followed thus the output layer consisted of one regression node only predicting the power of the ℎ appliance. In order to avoid overlap between training and test data, each of the evaluated datasets was equally split into two subsets, one for training the DNN models and one for evaluating the proposed architecture.

RESULTS AND DISCUSSION
The architecture presented in Section 2 was evaluated according to the experimental setup described in Section 3. The performance was evaluated in terms of estimation accuracy ( ), as proposed in [66], taking into account the estimated power ̂ where is the number of disaggregated frames and is the number of disaggregated devices including the ghost power, i.e.
For evaluating estimation accuracy on device level Eq. 5 was modified and the summation over appliances was eliminated resulting in Eq. 6 The NILM architecture with temporal contextual information (TCI) was tested for a set of temporal contextual windows of different length. The experimental results of the TCI architecture (i.e. the output of stage 1 in Figure  2) for different temporal contextual window lengths , with same for all devices and 1 ≤ ≤ 6, are shown in Table 4. The best performing length of the temporal contextual window for each of the evaluated datasets is indicated in bold. In the first column ( = 1) the performance without TCI is given. In the last column ( ) the estimation accuracy when using the optimal temporal contextual window separately for each device is shown. As can be seen in Table 4, the use of TCI improves energy disaggregation performance when compared to the baseline NILM system ( = 1) across all evaluated datasets. In the case of using temporal contextual window of same length for all devices, i.e. = 3 up to = 13, the best performing setup varies from = 5 to = 11. In general the datasets with optimal in low lengths ( ≤ 5) mostly have one/multi-state types of devices, while datasets with higher optimal TCI lengths ( ≥ 9) are dominated by devices of non-linear/continuous type. The NILM performance using TCI is further improved when the optimal temporal contextual window length per device is used ( ). Specifically the use of an optimized value for each device instead of a flat value for all devices improves the performance from 0.5% (REDD-4) up to 2.2% (ECO-2/REDD-1), in terms of absolute improvement. The use of device dependent TCI was found to improve the performance across all evaluated datasets and especially in the datasets with approximately equal energy consumption distribution of the appliances types, like datasets ECO-2 and REDD-1.
Next we evaluated the performance of the two-stage methodology presented in Section 2. The evaluation results of the proposed NILM architecture are shown in Table 5. For the purpose of direct comparison of the twostage architecture with the TCI approach (stage 1), the same training and test subset division was used in all evaluated datasets. The best achieved performance of TCI approach for each of the evaluated datasets shown in Table 4 is repeated in Table 5 as well. As can been seen in Table 5 the proposed two-stage methodology outperforms the TCI NILM architecture (stage 1) in all evaluated datasets. In detail, the highest performance improvement (when considering temporal contextual window of same length for all devices) in terms of values was observed in the REDD-3 dataset (+5.2% for = 5) followed by the REDD-2/ECO-5 dataset (+3.0%, for = 5) while the lowest improvement was found in the REDD-6 dataset (+0.1%, for = 3), when compared to the TCI NILM. Moreover the best energy disaggregation performance for ten out of eleven datasets was observed for temporal contextual window lengths between 3 ≤ ≤ 11 with the majority of the datasets having an optimal temporal contextual window length between 5 ≤ ≤ 9. In the case of ECO database (with only 6-9 appliances per dataset) the two stage NILM methodology offered an improvement of 0.5%-3.0% in terms of , while the REDD database (with 10-18 appliances per dataset) offered an improvement of 0.1%-5.2%. When considering the optimal temporal contextual window length per device (column ' ' in Table 5) the energy disaggregation improvement offered by the twostage NILM architecture is even higher. In particular, the highest performance improvement was observed in ECO-2 and ECO-4 datasets (+5.2% and +3.0%, respectively), while the lowest improvement was observed in ECO-5 dataset (+0.1%), when compared to the TCI NILM. When compared to the baseline NILM the highest performance improvement is +10.0% (iAWE) and the lowest one is +2.0% (ECO-6).
To further compare the results with the NILM methods proposed in literature the very recent work of [67] was used, which includes a summary of NILM performances for the REDD database for different setups. Approaches using the most popular experimental setup using houses 1,2,3,4 and 6 with all devices and measuring performance using the metric were considered. Moreover the results from [67] were extended by including recently published results [70,72] on the same experimental setup. It is worth mentioning that although the same data and the same accuracy metric was used, direct comparison is not assured as data splits or pre-processing might vary between the compared approaches (such information is not provided in most papers found in bibliography). The results are tabulated in Table 6. As can be seen in Table 6 the proposed fusion methodology outperforms all other reported approaches on the REDD-1/2/3/4/6 dataset setup. In detail the proposed approach outperforms the Powerlets approach [70] by 4.3%, while it performs 1.7% better than supervised GSP proposed in [72]. However it must be noted that the approach in [72] uses a reduced number of appliances and thus cannot be directly compared with the other NILM approaches.
Analysis of the proposed two-stage NILM methodology on device level was performed. In Table 7 the energy disaggregation improvement in terms of absolute increase of device estimation accuracy ( ) and the corresponding optimal temporal contextual window length per device, respectively, are presented. The first column in Table 7 denotes the type of each appliance as defined in Tables 1 and 2.  As can be seen in Table 7 appliances belonging to type A (i.e. single or multi-state appliances with their power consumption signature not varying in time, like air exhaust, disposal, electric heat, iron, lamp) are not significantly benefiting by the two-stage NILM methodology with temporal contextual information since the energy disaggregation improvement for type A devices ranges between 0.0%-3.4% with average improvement of 1.6%. Type B appliances (i.e. devices without strong temporal behaviour but with significant peak-power at the beginning of their power signature, like dishwasher, freezer, fridge, washer-dryer) were found to benefit from the proposed methodology with the energy disaggregation improvement for type B appliances ranging between 0.4%-17.8% with average improvement of 8.6%. In the case of non-linear appliances (appliances type C, e.g. electronic devices, entertainment, laptops), the power signature is usually strongly varying with time and the temporal contextual information can capture well their dynamic characteristics, with the energy disaggregation improvement for type C appliances ranging between 0.2%-12.7% with average improvement of 3.8%. As regards continuous devices (appliances type D, like air-conditioner and watermotor) their power signature appears in the form of an exponential rise or decay including significant power-peaks at the onset of their signature. Due to their slowly but strongly time varying behaviour their amplitude variation can be captured by temporal contextual information and misclassification with multi-state appliances of the similar consumption amplitude levels can be reduced, with the energy disaggregation improvement for type D devices ranging between 1.4%-44.7% with average improvement of 28.6%. The effect of the two-stage temporal contextual information NILM methodology proposed in Section 2 on each of the four appliance types is summarized in Table 8. As can be seen in Table 8, the energy disaggregation performance in type D devices improves by almost 30%, followed by type B benefiting by almost 10%. Also the average optimal temporal contextual window length for appliance types D and B is = 9.00 and = 7.38, respectively. For the case of non-linear appliances (type C) the performance improvement is almost 4%, however the average optimal window length is greater than the one of type B, which is most probably owed to the longer duration of patterns as well as the non-repetitive micropatterns within non-linear appliances. Furthermore the two-stage architecture improves the detection of continuous or non-linear appliances as they can be highly related to the daily routine of the users/consumers or even be related/dependent to each other as for example in the case of TV and Entertainment appliances which are usually interconnected. For such devices, with inter-device dependencies or daily routine patterns, the apriori knowledge of the power consumption of other devices they operate together with or devices with similar daily routine (i.e. usually operating or not operating simultaneously) can be beneficial for the estimation of their power consumption. Such devices can benefit from the fusion stage of the proposed architecture in which estimates of the power consumption of the other appliances (calculated from the 1st stage) are used as input. Except this, detection of devices with power spikes, i.e. peaks that appear during the switching on of electrical motors, e.g. in fridges or freezers, was found to benefit from the fusion stage of the proposed methodology, since the presence of a power spike within a frame affects the distribution of energy among the set of devices to be disaggregated which is implicitly expressed by the power consumption estimates of each device detector computed at the first stage of the proposed architecture. The power signature for each appliance type was illustrated in Figure 3.

CONCLUSION
A two-stage methodology for energy disaggregation using temporal contextual information was presented. The methodology extends the baseline non-intrusive load monitoring (NILM) approach by employing a two-stage disaggregation and using temporal expansion of the feature vectors within a time window of variable length. The proposed methodology was evaluated using the real aggregated signal as measured by the smart-meter across various datasets of different sampling frequency, number and types of appliances, demonstrating improvement of performance across all datasets. The maximum improvement in terms of absolute increase of accuracy was equal to 10.0% when using appliance-driven temporal contextual information lengths and two-stage disaggregation. In detail the most significant improvements were observed for devices with power-peaks and exponential decay power consumption signatures such as refrigerators and air conditions. Moreover improvements in energy disaggregation performance were observed for appliances with strong time varying power signatures like electronic devices e.g. stereos, laptops or entertainment electronics. With the use of the fusion stage inter-device dependencies or daily routine patterns can be modelled and power spikes can be found, thus resulting in further improvement of the disaggregation accuracy.