After getting the hybrid dataset composed of real electricity data and specific electricity stealing data, various detection models can be adopted [9,10,11,12,13,14,15,16,17,18,19]. However, due to two features of \({\bar{\mathbf{S }}}\), i.e., small scale of electricity stealing and random coherence with time, the performance of these models is not optimal and deep features are needed. In the process of seeking for deep features, the value of electricity consumption data is small and its change trend is usually gentle. For this case, if we take neural networks with multiple layers, the vanishing gradient [22] will affect the performance of detection models. Aiming at this issue, ResNet model [23] attracts wide attention. It adds some shortcut connections to preserve shallow features so that the problem of vanishing gradient can be solved. However, the performance of ResNet is still not optimal, because every time when the convolution kernels extract features of electricity data, the obtained information is limited. Therefore, in this paper, we introduce feature-map attention [24] and then propose Bi-ResNet.

### 3.1 Unit block of Bi-ResNet

The unit block of Bi-ResNet consists of two kinds of basic components, i.e., convolution modules and shortcut connections, and two kinds of logical operations, i.e., copy and vector addition. Its structure is shown in Fig. 3. For convenient analysis, we further divide the unit block into three steps.

For the step 1, the obtained \({\bar{\mathbf{S }}}\) is first mapped from three dimension to two dimensions for the adaption of the input. For the user *k* on the *d*-th day, the input data \({\bar{\mathbf{S }}}_{k,d}\) are changed from the \({{\varvec{{s}}}_{k,d}}\) as follows

$$\begin{aligned} {\bar{\mathbf{S }}}_{k,d} = \left[ {\begin{array}{*{20}{c}} {{s_{d,1}}}&{}\quad {{s_{d,2}}}&{}\quad \cdots &{}\quad {{s_{d,167}}}&{}\quad {{s_{d,168}}}\\ {{s_{d,2}}}&{}\quad {{s_{d,3}}}&{}\quad \cdots &{}\quad {{s_{d,168}}}&{}\quad {{s_{d,1}}}\\ \vdots &{}\quad \vdots &{}\quad \ddots &{}\quad \vdots &{}\quad \vdots \\ {{s_{d,167}}}&{}\quad {{s_{d,168}}}&{}\quad \cdots &{}\quad {{s_{d,165}}}&{}\quad {{s_{d,166}}}\\ {{s_{d,168}}}&{}\quad {{s_{d,1}}}&{}\quad \cdots &{}\quad {{s_{d,166}}}&{}\quad {{s_{d,167}}} \end{array}} \right] \end{aligned}$$

(3)

When \({\bar{\mathbf{S }}}_{k,d}\) is put into the network, the copy function \({\mathrm{cp}}(\cdot )\) is taken, which duplicates the input data into two same copies

$$\begin{aligned} \{ {\mathbf{X }}_1,{\mathbf{X }}_2\} = {\mathrm{cp}}({\bar{\mathbf{S }}_{k,d}}). \end{aligned}$$

(4)

Then, two convolution modules with different size of convolution kernels are taken. Specifically, each convolution module function \(i, \forall i \in \{1,2\}\) consists of three parts, i.e., convolution function \({\mathrm{conv}}( \cdot )\), batch normalization \({\mathrm{BN}}( \cdot )\) and rectified linear unit function \({\mathrm{relu}}( \cdot )\). We use \({H}_{i,{\mathrm{size}}}(\cdot )\) function to represent the *i*-th convolution model with size \(\times\) size convolution kernel, thereby the *i*-th pseudo-output \({\mathbf{Y}}_{i}\) is calculated by

$$\begin{aligned} {\mathbf{Y }}_{i} = {H_{i,{\mathrm{size}}}}({\mathbf{X }}_i). \end{aligned}$$

(5)

Notice that, when two convolution kernels with different sizes are used for the same input data, the features can be extracted more thoroughly, that is essential for deep features.

For the step 2, we fuse the same feature under different scopes, i.e., convolution modules with \(3 \times 3\) convolution kernel and \(5\times 5\) convolution kernel and then use one convolution module with \(1\times 1\) convolution kernel to extract the deep feature based on the fused feature. Specifically, we first duplicate the pseudo-output \({{\mathbf{Y }}_{i}}, i \in \{1,2\}\) as

$$\begin{aligned} \{ {\mathbf{X }}_{i\times 2+1},{\mathbf{X }}_{(i+1)\times 2}\} = {\mathrm{cp}}({{\mathbf{Y }}_{i}}). \end{aligned}$$

(6)

After duplicating, \({\mathbf{X }}_{i\times 2+1}\) are connected by a shortcut connection that can maintain the shallow feature. As for \({\mathbf{X }}_{(i+1)\times 2}\), they are fused by the \({\mathrm{va}}(\cdot )\) function as

$$\begin{aligned} {\mathrm{va}}({\mathbf{X}} _{{4}},{\mathbf{X}} _6) = {1 /2} \times {\mathbf{X}} _{{4}} + {1 /2} \times {\mathbf{X}} _{{6}}, \end{aligned}$$

(7)

and the deep feature is extracted by the 3-rd convolution model with \(1 \times 1\) convolution kernel. Accordingly, the 3-rd pseudo-output \({{\mathbf{Y }}_{3}}\) is calculated by

$$\begin{aligned} {{\mathbf{Y }}_{3}} = {H_{3,1}}({\mathrm{va}}({\mathbf{X}} _{{4}},{\mathbf{X}} _6)). \end{aligned}$$

(8)

To make the pseudo-output contain the deep feature and the shallow feature, we duplicate the \({{\mathbf{Y }}_{3}}\) as

$$\begin{aligned} \{ {\mathbf{X }}_{7},{\mathbf{X }}_{8}\} = {\mathrm{cp}}({{\mathbf{Y }}_{3}}), \end{aligned}$$

(9)

and fuse them with the \({\mathbf{X} }_{i\times 2+1},i \in \{1,2\}\) which come from shortcut connections, calculated by

$$\begin{aligned} {\mathrm{va}}({\mathbf{X}} _{i\times 2+1},{\mathbf{X}} _{i\times 3+1}) = {1 /2} \times {\mathbf{X}} _{i\times 2+1} + {1 /2} \times {\mathbf{X}} _{i\times 3+1}. \end{aligned}$$

(10)

Herein, we get two set of excellent pseudo-output with hybrid features. One is the feature that maintains the shallow feature gotten by \(3 \times 3\) convolution kernel and meanwhile contains more attentive and deep feature gotten by \(5\times 5\) convolution kernel. On the contrary, the other one maintains the shallow feature gotten by \(5\times 5\) convolution kernel and contains more attentive and deep feature gotten by \(3\times 3\) convolution kernel.

For the step 3, the features of pseudo-output \({\mathrm{va}}({{\mathbf{X }}_{{{i}} \times {{2}} + {{1}}}},\) \({{\mathbf{X }}_{{{i}} \times {{3}} + {{1}}}}),i \in \{ 1,2\}\) are extracted by two convolution modules with \(5\times 5\) convolution kernel and \(3\times 3\) convolution kernel. The pseudo-output \({{\mathbf{Y }}_{i+3}}, i \in \{1,2\}\) is calculated by

$$\begin{aligned} {\mathbf{Y }}_{i+3} = {H_{i+3,{\mathrm{size}}}}({\mathrm{va}}({{\mathbf{X}} _{i\times 2+1}},{\mathbf{X}} _{i\times 3+1})). \end{aligned}$$

(11)

Then, the \({{\mathbf{Y} }_{i+3}}, i \in \{1,2\}\) are merged again with \({\mathrm{va}}(\cdot )\) function and 6-th convolution model with \(1\times 1\) convolution kernel, expressed as

$$\begin{aligned} {\mathbf{Y }}_{6} = {H_{6,1}}({\mathrm{va}}({{\mathbf{Y}} _4},{\mathbf{Y}} _5)). \end{aligned}$$

(12)

Notice that, \({\mathbf{Y }}_{6}\) is the final output of the Bi-ResNet’s unit block and, compared with the input data \({{\bar{\mathbf{S }}}_{k,d}}\), its dimension is reduced by 6, i.e., from \(168 \times 168\) to \(162 \times 162\).

### 3.2 Framework of Bi-ResNet

After a series of mixed and cross-learning, the final hybrid feature contains various deep features and shallow features with different levels of attention, thereby the vanishing gradient disappears and deep features are obtained. Herein, the framework of Bi-ResNet is developed, as shown in Fig. 4.

Considering that the input data of detection model are the hourly electricity data within one week, we design Bi-ResNet which contains the \(3\times 3\) convolution kernel, 5 Bi-ResNet blocks, 4 maxpooling layers with stride 2, 1 maxpooling layer with stride 1, and some traditional components of CNN [25]. Specifically, due to the fact that shallow features are important but relatively easy to extract, we first pick a convolution kernel with smaller size, i.e., \(3\times 3\), and Bi-ResNet block is unnecessary for this step. Then, we use Bi-ResNet blocks and maxpooling layers to extract deep features. Notice that, when the input data are processed by the Bi-ResNet blocks, maxpooling layers with stride 2 and maxpooling layers with stride 1, the size of output data is reduced by 4, halved, and reduced by 1, respectively. Therefore, a hourly electricity data with the size of \(168\times 168\) are put into the Bi-ResNet and the output data with the size of \(6\times 6\) can be obtained after maxpooling 5. Subsequently, the type of hourly electricity data will be determined by the fully connected layer, and classification layer distinguishes whether the user steals electricity.