Transfer restless multi-armed bandit policy for energy-efficient heterogeneous cellular network

Modi, Navikkumar; Mary, Philippe; Moy, Christophe

doi:10.1186/s13634-019-0637-1

Research
Open access
Published: 21 October 2019

Transfer restless multi-armed bandit policy for energy-efficient heterogeneous cellular network

Navikkumar Modi¹^na1,
Philippe Mary² &
Christophe Moy³

EURASIP Journal on Advances in Signal Processing volume 2019, Article number: 46 (2019) Cite this article

1613 Accesses
2 Citations
Metrics details

Abstract

This paper proposes a learning policy to improve the energy efficiency (EE) of heterogeneous cellular networks. The combination of active and inactive base stations (BS) that allows for maximizing EE is identified as a combinatorial learning problem and requires high computational complexity as well as a large signaling overhead. This paper aims at presenting a learning policy that dynamically switches a BS to ON or OFF status in order to follow the traffic load variation during the day. The network traffic load is represented as a Markov decision process, and we propose a modified upper confidence bound algorithm based on restless Markov multi-armed bandit framework for the BS switching operation. Moreover, to cope with initial reward loss and to speed up the convergence of the learning algorithm, the transfer learning concept is adapted to our algorithm in order to benefit from the transferred knowledge observed in historical periods from the same region. Based on our previous work, a convergence theorem is provided for the proposed policy. Extensive simulations demonstrate that the proposed algorithms follow the traffic load variation during the day and contribute to a performance jump-start in EE improvement under various practical traffic load profiles. It also demonstrates that proposed schemes can significantly reduce the total energy consumption of cellular network, e.g., up to 70% potential energy savings based on a real traffic profile.

1 Introduction

The increasing popularity of portable smart devices has flared up rising traffic demand for radio access network and has been arousing massive energy consumption, which leads to the exhaustion of energy resources and causes a potential increase of CO ₂ emissions. Data centers, back-haul routers, and cellular access networks are the main source of energy consumption in the information and communication technology industry, which is equivalent of 2 to 10% of the global overall power consumption of human activity [2]. In cellular networks, the energy consumption of base stations (BS) is about 60 to 80% of the overall power consumption of the cellular network [3]. Besides, cellular network operators require to spend more than 10 billion dollars to meet current energy consumption of the cellular network [4, 5]; thus, there exist both environmental and high economical pressures for cellular network operators to take into account an energy efficiency aspect of the network. The main reason for such high energy consumption is because BS, and more generally cellular networks, are designed on a peak traffic load basis.

In fact, due to the traffic load variation in time domain and dynamic distribution of cellular users among cells in space domain, there are opportunities for some BS to be put in sleep mode in order to achieve higher energy efficiency (EE). The side BS components, controller, and air conditioner are the main sources of energy consumption, rather than transmit power which consumes only 3.1% of the BS power consumption [6]. Recent studies on the real temporal traffic have stated that BS are largely underutilized, e.g., traffic load can be below 10% of peak load during 30% and 45% of the day during weekdays and weekends, respectively [7]. Thus, instead of just turning off radio transceivers, the BS operators may prefer to turn off the underutilized BS and transfer the imposed traffic loads to neighbor active BS during low-traffic periods such as night time and/or weekend, which reduces the energy consumption [4].

Recently, there has been a rising interest on the works dealing with switch ON and OFF BS according to the traffic load; however, it is essential for network operators to guarantee radio coverage and quality of service (QoS) to the cellular users. Dynamically switching the BS’ operation mode to ON and OFF with respect to traffic load fluctuation is considered to be one of the effective methods to reduce total energy consumption of cellular network while maintaining good QoS. Moreover, BS operation mode switching decision cannot be made individually at each BS level, since it does not only depend on the load of the cell of interest but also on the load of its neighbors. For instance, a BS may not be turned to sleep mode while its neighboring BS are overloaded, even if its own traffic load is very low. The problem of EE maximization with BS switching operation is a famous combinatorial class of problem. In machine learning, combinatorial problems are mostly addressed with centralized decision made by a central controller taking into account a global information, i.e., channel state information and traffic load information.

In this work, the best BS deployment is learnt in order to maximize the network EE under QoS constraints, by switching ON/OFF some BS. The EE maximization problem is tackled under the multi-armed bandit (MAB) approach where arms are represented by the deployment configurations. MAB is a class of sequential decision-making paradigm where, given a set of arms, a user selects an arm at each slot in order to collect some reward. The most important property of MAB paradigm is that a player does not need to know a prior information about each arms’ reward distribution, making MAB an interesting solution for EE maximization problem. In this paper, we focus on a restless upper confidence bound policy which has been proven to be efficient for opportunistic spectrum access (OSA) problem [8–10], where selecting an arm leads to two different rewards associated with it. In this paper, the algorithm we proposed in [10], i.e., RQoS-UCB policy, is adapted to the problem of finding the optimal BS configuration that maximizes the observed energy efficiency in the long run. Moreover, the transfer learning (TL) concept [11, 12] is applied in our context, where the temporal dependence in the traffic load between 2 days is exploited in order to increase the convergence speed of the current learning.

1.1 Related work

Recently, there has been a substantial body of work on traffic load-aware BS adaptation, and the authors in [13, 14], have validated the possibility of improving EE and also showed the energy saving gains by simulations. In [15–17], authors proposed to dynamically adjust the sleeping status of BS, depending on the learnt and predicted traffic load of the network. The works in [18, 19] introduced some BS switching strategies for dynamic BS operations depending on daily traffic variation. However, reliable prediction of BS traffic load is still an important challenge for network operators, which limits its usefulness in practical applications. An alternative energy-efficient procedure is the relay station switching technique employed in [20], where certain BS being turned to sleep mode and switched on the low-powered relay station mode during the low-traffic intervals. On the contrary, authors in [21] introduced reinforcement learning (RL) algorithms as an application of dynamic BS switching operation; however, these algorithms are highly dependent on the a priori knowledge of the traffic load.

As stated in [12, 22, 23], the problem of EE maximization with BS switching operation is a combinatorial problem, and it has been proven to be NP-hard. Instead of directly addressing this problem, the authors in [24, 25] adopted fixed BS switching patterns and then evaluated the call blocking probability and the outage probability. In [4, 17, 26], some greedy algorithms have been introduced to tackle BS switching operation without presenting sufficient theoretical guarantees of convergence to optimal configuration. The authors in [26] have taken forward a greedy algorithm to handle the trade-off between the energy consumption and the revenue in heterogeneous cellular networks. Then, [12, 16] used Markov decision process (MDP) to model the traffic load prediction and used a RL approach, named actor-critic algorithm [27], to predict the traffic load of the network without prior information about it. Moreover, authors in [12] extended the actor-critic algorithm by including the TL concept [11] leading to transfer actor critic (TACT) algorithm in order to use the knowledge acquired in previous learning phases. Actor-critic-based algorithms provide good performance but are generally more complex than upper confidence bound (UCB) algorithms. Moreover, the existing works for BS energy saving problem often lack for theoretical analysis on the convergence. On the contrary, decentralized schemes for dynamic BS switching operation [28–31] are more beneficial as they do not require a central controller, but demand more information exchanges. However, all the existing decentralized schemes do not present theoretical analysis on the convergence, which makes them less appealing from theoretical point of view.

We assert that problems such as channel allocation in dynamic spectrum access can be of the same nature than the problem of base station switching, i.e., restless MAB. As a consequence, works dealing with learning strategies for OSA scenario for instance can be related to our approach. In [32], authors tackled the problem of MAB with Markovian rewards that can be applied to OSA or base station switching for green networking. In OSA scenario, the most common addressed optimization problem is to find the band with the highest probability to be vacant. In that case, rewards are generally modeled as binary, but some works have also dealt with continuous rewards rating the quality of the bands for instance. The authors of [33, 34] considered the problem of finding the channel that gives the best data rate when data rates on channels are drawn from a Markovian distribution. But in these works, channel quality and availability have never been considered separately. In our previous works [8, 10], we proposed a new restless upper confidence bound for Markovian settings in OSA problem. The proposed scheme, named restless quality of service upper confidence bound (RQoS-UCB), allows the radio for learning about the spectrum opportunities, i.e., bands that are less used by a primary network for instance, and also on a quality indicator of the bands that have been identified as unoccupied. The proposed scheme has been proven to have a logarithmic regret which is the best behavior a learning policy may have, and its ability to converge toward the best band, in terms of availability and quality as well, has been shown.

1.2 Contributions

This paper tackles the problem of EE maximization from the restless MAB framework, considering varying traffic load. The paper includes the following contributions:

This paper adapts the UCB policy in [10] to fit with the EE maximization problem. In particular, the state reward is fed-back when the BS configuration fulfills a set of constraints of the EE maximization problem. The soft reward is matched with the energy efficiency of the network.
The proposed algorithm includes a transfer learning stage in order to speed up the convergence toward the best deployment configuration.

To the best of our knowledge, MAB-UCB has never been applied to the dynamic configuration by switching ON and OFF BS in order to maximize the network EE. Our algorithm that learns on the state of the network and on the energy efficiency is proven to be efficient to solve the green networking problem.

1.3 Paper structure

The remainder of the paper is organized as follows. Section 2 introduces the system description and EE maximization problem formulation. In Section 3, the traffic load variation is formulated as an MDP and the EE maximization algorithm, energy efficiency maximization-upper confidence bound (EEM-UCB), is presented. Moreover, the TL concept is embedded in the proposed EEM-UCB algorithm to form the TLEEM-UCB algorithm. Section 4 numerically evaluates and compares the proposed schemes with the state-of-the art methods and presents the validity and effectiveness. Finally, Section 5 concludes this paper and presents future way of researches.

2 Methods and problem formulation

Beforehand, Table 1 summarizes the notations in this paper.

Table 1 List of the main symbols in the paper

Full size table

2.1 Network model

In this work, we consider a heterogeneous wireless cellular network comprising of a mixture of macro and small cells, each governed by a macro or micro BS, respectively, where set of BS $\mathcal {Y}=\{1,2,\cdots,Y\}$ lies in a two-dimensional area in $\mathbb {R}^{2}$. In addition, we assume that there exists a central controller, which can timely know the traffic load in the network at each instant and can predict the energy efficiency of BS at next stage. Let us assume that all BS operate in an open access mode, i.e., any MS is allowed to connect to any BS whatever it belongs to the micro or macro tier [25]. We focus on the downlink communication as mostly considered for the mobile Internet application. The network area is divided according to the Voronoi tessellation with BS acting as seeds for each cell. Each cell coverage in wireless cellular network is denoted as $\mathcal {I}_{k}(n), k = 0, 1,2,\cdots $ at time slot n. At a given time slot, the set of active BS, denoted as $\mathcal {Y}^{\text {on}}$, defines a partition of the space. As the set of active BS is changing from a time instant to another, MSs connect always to the nearest BS, as explained in Section 2.1.2 micro or macro. Each configuration of $\mathcal {Y}^{\text {on}}$ leads to a certain rate and energy consumption, whose computation is detailed in the following, and we aim at finding the configuration maximizing the energy efficiency, while guaranteeing a minimum data rate to all users.

2.1.1 Traffic profile

Let $x_{k} \in \mathcal {I}_{k}(n)$ be the two-dimensional Cartesian coordinates, denoting the locations of MS in the coverage of the kth BS at time slot n. An MS is referred as active when it is receiving a call. When the call ends, the MS becomes inactive and is departed from the network. Traffic load of a BS is measured in terms of the number of active MSs and their respective call duration.

At each time slot n, new and handover call at x_k follows a Poisson point process with arrival rate per time-unit Λ(x_k,n). The associated call duration (or file size) is assumed to be exponentially distributed with mean 1/h(x_k,n). Then, the instantaneous traffic load at location x_k can be expressed as $L(x_{k},n) = \frac {\Lambda (x_{k},n)}{h(x_{k},n)}$ at time slot n [29]. By setting different arrival rates or call holding time for MSs located in different cells, this model can capture temporal and spatial traffic variability. Thus, when the set of BS $\mathcal {Y}_{n}^{on}$ is switched ON at time slot n, the instantaneous traffic load served by BS $k \in \mathcal {Y}_{n}^{on}$ can be expressed as:

$$ L_{k}(n) = \sum\limits_{x_{k} \in \mathcal{I}_{k}(n)} \frac{\Lambda(x_{k},n)}{h(x_{k},n)}. $$

On the contrary when BS k is turned OFF, the instantaneous traffic load served by BS k is defined as zero, i.e., L_k(n)=0. The total arrival rate of a BS k is the composition of all Poisson arrivals at different locations in $\mathcal {I}_{k}$, which again forms a Poisson process [35]. Moreover, the daily traffic profile of the whole cellular network repeats periodically as recorded by several works [7, 20]. This model will be useful when considering the performance of the learning algorithms during the day.

2.1.2 BS selection rule

An MS is assumed to connect with the nearest BS, in order to suffer from the least path loss during the wireless transmission. An active MS located at x_k is connected with and served by the BS $k, k\in \mathcal {Y}^{on}_{n}$ which presents the best received signal strength at each time slot n^{Footnote 1} and where $\mathcal {Y}^{on}_{n}$ is the set of active BS at instant n.

2.1.3 Channel model

The service rate of an active MS at location x_k provided by the kth BS at the nth time slot is assumed to be equal to the Shannon capacity:

$$\begin{array}{@{}rcl@{}} \Theta_{k}(x_{k},n) = B_{a} \cdot \log_{2}\left(1 + \text{SINR}_{k}(x_{k}, n)\right) \end{array} $$

(1)

where B_a denotes the system bandwidth, SINR_k(x_k,n) is the received signal to interference plus noise ratio (SINR) at x_k from BS k at the nth time slot, and is defined as

$$ \text{SINR}_{k}(x_{k}, n) = \frac{g_{k,x_{k}}(n) P^{tx}_{k}}{\phi g_{k,x_{k}}(n) P^{tx}_{k} + \sum\limits_{m \in \mathcal{Y}^{on}_{n} \backslash \{k\}} g_{m,x_{k}}(n) P^{tx}_{m} + \sigma^{2}} $$

(2)

where $g_{k,x_{k}}(n)$ is the average channel gain from BS k to active MS at location x_k at the nth time slot. The channel gain only comprises path loss in this paper, but log-normal shadowing and fading can be taken into account easily without changing the principle of the learning policy that will be introduced in the next section. Moreover, $P^{tx}_{k}$ is the transmission power of BS k, σ² is the noise power, and $\sum \limits _{m \in \mathcal {Y}^{on}_{n} \backslash \{k\}} g_{m,x_{k}}(n) P^{tx}_{m}$ is the interference power experienced by MS x_k from its neighboring BS at the nth time slot. The parameter ϕ is the orthogonality (or self interference) factor, $\phi \in \left [0,1\right ]$, and $\phi g_{k,x_{k}}(n) P^{tx}_{k}$ models intra-cell interference [38].

2.1.4 System load

In order to satisfy the QoS requirement of MSs, a BS should provide a certain amount of resources (e.g., time or frequency) in order to absorb the MS traffic load and provide enough service rate to users. From the system’s perspective, the system load of BS k at the nth time slot is estimated as the fraction of resource to serve the total traffic load in its coverage [29]

$$\begin{array}{@{}rcl@{}} \rho_{k}(n) = \sum\limits_{x_{k} \in \mathcal{I}_{k}(n)} \frac{L(x_{k},n)}{\Theta_{k}(x_{k},n)}. \end{array} $$

(3)

The system load denotes the fraction of time required to serve the total traffic load in the coverage of the kth BS. Eventually, our main goal is to choose the set of active BS that maximizes the global network energy efficiency without having a prior on traffic load statistic. We will give the details in Section 3.

2.1.5 Power consumption model

The total power consumed $P^{k}_{T}(n)$ by each BS k at the nth time slot can be expressed as [39]:

$$\begin{array}{@{}rcl@{}} P^{k}_{T}(n) = a_{k} P^{tx}_{k}(n) + P^{k}_{f} \end{array} $$

(4)

where $P^{tx}_{k}(n)$ denotes the transmission power of BS k at the nth time slot and $P^{k}_{f}$ denotes the static power consumption independent of $P^{tx}_{k}(n)$ and includes all electronic circuit power dissipation due to site cooling, signal processing hardware, and battery backup systems. a_k is a BS power scaling factor which reflects both amplifier and feeder losses.

2.2 Problem formulation

The energy efficiency of a cell k in bits per joule at instant n is the ratio between the data sum-rate of the cell over the power used to run the cell. The network EE is then the aggregate EE of each cell and can be expressed as [40]:

$$\begin{array}{@{}rcl@{}} EE(n) = \sum\limits_{k \in \mathcal{Y}^{on}_{n}}\frac{\sum\limits_{x_{k} \in \mathcal{I}_{k}(n)}\Theta_{k}(x_{k},n)}{P^{k}_{T}(n)} \end{array} $$

(5)

EE maximization in the cellular network, without power allocation strategy, can be reduced to find the set of active BS that maximizes (5). The problem can be formally written as

$$\begin{array}{*{20}l} & \mathcal{Y}^{on^{*}}_{n} = \arg \max_{\mathcal{Y}^{on}_{n}} \left[\sum\limits_{k \in \mathcal{Y}^{on}_{n}}\frac{\sum\limits_{x_{k} \in \mathcal{I}_{k}(n)}\Theta_{k}(x_{k},n)}{P^{k}_{T}(n)}\right] \\ s.t.&\hspace{0.5cm} 0 \leq \rho_{k}(n) \leq \rho_{th}, \forall k \in \mathcal{Y}^{on}_{n} \\ &\hspace{0.5cm} \Theta_{k}(x_{k},n) \geq \Theta^{\min}, \forall x_{k} \in \mathcal{I}_{k}(n), \forall k \in \mathcal{Y}^{on}_{n} \\ &\hspace{0.5cm} \mathcal{Y}^{on}_{n} \neq \emptyset \end{array} $$

(6a) (6b) (6c) (6d)

Like in [12, 29], a system load threshold ρ_th≤1 is introduced as a constraint, (6b), in order to keep the system stable. Indeed, the service rate of a user, i.e., $\Theta _{k}\left (x_{k},n\right)$, should be sufficient to absorb the traffic load at x_k. If not, some transmissions may be delayed and should be taken into account in the model, which is out of scope of the paper. For instance, low threshold value ρ_th indicates that BS would operate in a more conservative manner with low delay and low call dropping probability for MSs since all calls can be routed to users. On the contrary, with a high threshold ρ_th value close to 1, the data rate of users is just enough to avoid overflow implying a limited power consumption but with an increasing call dropping probability. The constraint (6c) guarantees a minimum data rate $\Theta ^{\min }$ to each active user, and the constraint (6d) states that there is at least one active BS.

The above problem can be proven to be NP-complete by reducing from a vertex cover problem [22, 29]. Finding the set of active BS maximizing network EE by an exhaustive search is very costly in computational resources since 2^Y−1 ON/OFF combinations have to be tried, specifically when the number of BS is large. This problem can rather be tackled under MAB approach where a specific combination is tried at each iteration and a reward (EE of the system) is collected. In the next section, we will show how this principle can lead to a good state.

3 RL for energy-efficient network

3.1 System model

The dynamic BS switching problem is modeled as an MAB under Markovian settings. Figure 1 illustrates the principle of the learning policy with MAB approach where an arm represents different configuration of BS’ activity. We defined an MDP for BS switching operation as a tuple $\mathcal {M}=<\mathcal {S},\mathcal {K},\mathcal {P},R>$, where $\mathcal {S}$ denotes the state space, $\mathcal {K}$ denotes the action space, $\mathcal {P}$ denotes a state transition probability matrix, and finally R is a reward function associated with $\mathcal {S}$, $\mathcal {K}$, and $\mathcal {P}$. At each iteration, the controller chooses an action i among $\left |\mathcal {K}\right | = 2^{Y}-1$ possible actions, i.e., $\mathbf {a}^{i}(n) = \left [a^{i}_{1}(n), \cdots, a^{i}_{Y}(n)\right ]$ where $a_{k}^{i}(n) = 1$ if BS k is switched ON in action number i at time n and 0 otherwise. This action leads the network to a given state $S^{i}(n)\in \left \{0,1\right \}$, where Sⁱ(n)=1 if all constraints from (6b) to (6d) are satisfied and 0 otherwise.

Due to the random process governing the time evolution of the traffic load, the state Sⁱ(n) transforms into Sⁱ(n+1) at the next time instant according to a transition probability measure for arm i, i.e., $\mathcal {P}^{i} = \left \{P^{i}_{k,l}, k,l \in \mathcal {S}, i \in \mathcal {K}\right \}$. Moreover, the Markovian process is considered as stationary, on a short-time period, e.g., 1 h, and hence, the distribution of this MDP is such as $\pi ^{i}_{S}(n) = \pi ^{i}_{S}, \forall n$. The reward achieved in state Sⁱ from the BS switching operation i after n time slot is, without loss of generality, equal to the value of the state, Sⁱ(n), i.e., 0 or 1. In addition, we consider that the network EE achieved by switching BS status according to the action number i is the second reward, i.e., $R^{i}_{S}(n) = EE(n)$, computed from (5) for a given environment state Sⁱ(n). The reward on EE and the state are fed back to the controller in order to decide the next action to take. The mean reward μⁱ associated with BS switching operation i under stationary distribution $\boldsymbol {\pi }^{i}_{S}$ is given by: $\mu ^{i}=\sum _{S \in \mathcal {S}} S^{i} G^{i}_{S} \pi ^{i}_{S}$, where

$$ G_{S}^{i} \left(T^{i}(n)\right)= \frac{1}{T^{i}(n)} \sum_{p = 1}^{T^{i}(n)} R_{S}^{i}(p) $$

(7)

where Tⁱ(n) refers to the number of times the BS switching operation i has been performed by the controller up to time n. The policy $\mathcal {A}$ is a one-to-one mapping such as at each time slot n, a BS switching operation i is selected:

$$\begin{array}{rcl} \mathcal{A}: \mathbb{N} & \longrightarrow & \mathcal{K} \\ n & \longmapsto & i \end{array} $$

The goal of a RL policy is to minimize its regret on the long run, i.e.,

$$\begin{array}{@{}rcl@{}} \Phi^{\mathcal{A}}(n) = n\mu^{*} - \mathbb{E}\left[ \sum_{t=1}^{n} S^{\mathcal{A}(t)} (t) G^{\mathcal{A}(t)}_{S^{\mathcal{A}(t)}} (t) \right] \end{array} $$

(8)

where the expectation $\mathbb {E}$ is taken over the states and observed reward. Let $S^{\mathcal {A}(t)}$ be the state observed by using the policy $\mathcal {A}$ at time slot t. Moreover, μ^∗ is the optimal mean reward obtained by always selecting the best action at each time t.

3.2 Restless energy efficiency maximization - upper confidence bound (EEM-UCB)

In this section, we adopt the restless UCB policy, i.e., RQoS-UCB, that we proposed in [10] for learning the best bands in OSA context based on their probability to be free and the quality of the band. The principle of the algorithm is adapted to the current problem, where the policy aims at finding an optimal set of active BS which maximizes the energy efficiency of the network and will be named EEM-UCB.

When dealing with an MAB problem, one should first ask if it belongs to rested or restless category. In the former category, the state of the Markov chains corresponding to arms that are not played does not evolve with time and only the Markov chain of the selected arm does. In the later, states of all Markov chains continue to evolve whatever they are selected or not. Our problem fits with the later category. Indeed, the traffic request of users does not depend on selected BS; however, the selected configuration definitely influences the traffic load of the network by distributing the data flow among BS. EEM-UCB algorithm operates in a block structure as represented in Fig. 2 which is based on regenerative cycle [41, 42]. Each block is divided into three sub-blocks, SB1, SB2, and SB3. For each arm i, a regenerative state ζⁱ is defined, i.e., 0 or 1 in our case, and SB1 comprises all time slots from the selection of configuration i to the first visit of the state ζⁱ. SB2 contains all time slots from the first visit to ζⁱ up to, but excluded, the second visit to ζⁱ. The last block is only the second visit to the state ζⁱ. The selection of the active BS set is based on an index computation Bⁱ(n) for configuration i and will be formally expressed in the next section. The computation of the index occurs after the completion of SB3. The reason of this structure lies on the restless nature of arms which evolve even if they are not played. The distribution of the state reward obtained by playing a given arm is the function of the time elapsed since the last time the same arm has been played. In order to deal with an homogenous Markov chain, the stay in a given arm should be sufficiently long in order to reconstruct a sample path with the same statistical characteristic of the Markov chain governing the arm [42]. It is worth noting, however, that this structure does not prevent to collect rewards, state and EE, in any blocks. This sub-division just comes for mathematical convergence proof.

At a given time slot n, policy $\mathcal {A}$ selects the BS switching operation that has the highest policy index $B^{i}\left (n,T^{i}(n)\right)$ at time n. This action may transform the current state Sⁱ(n) of the network to another state Sⁱ(n+1) with certain probability $\mathcal {P}$. The new reward $R^{i}_{S}(n)$, i.e., energy efficiency, is fed back to the controller. Then, the policy $\mathcal {A}$ updates the policy index $B^{i}\left (n+1,T^{i}(n+1)\right)$ with the empirical average on the state Sⁱ and the empirical mean of the energy efficiency experienced so far. The algorithm repeats the above procedure until convergence to optimal BS switching operation during each hour of operation. The formal description of the index computation is given in Section 3.3.

3.3 Transfer learning EEM-UCB (TLEEM-UCB) policy

The previous strategy may suffer from traffic load variation from 1 day to another at a given period of time due to the variation of Poisson arrival rate between two consecutive days. This rather advocates for learning from scratch at each new day with the new, unknown, statistic characterizing the underlying Markovian process. In that case, the network would loose time to re-learn the best deployment configuration it learnt the day before at the same hour. Another strategy would consist in using the previous knowledge the controller learnt during some historical periods to find the current optimal BS switching operation. This strategy would make even more sense as per Poisson arrival rate does not change too much between two consecutive days as we will see in Section 4.

The motivation for transfer learning is to utilize previous learnt features on a given task (source task) in order to speed up the learning phase of different features on a target task as illustrated in Fig. 3. In other words, the controller uses the BS deployment learnt in previous time period for the current task with its own statistical characteristics. We hence propose a new policy update method, named Transfer Learning EEM-UCB (TLEEM-UCB) policy, that is detailed in Algorithm 1. In the source policy, the reward achieved in state S^i,h from a BS switching operation $i \in \mathcal {K}$ during $H_{2}^{i}$ time slots is $S^{i,h}\left (H_{2}^{i}\right)$, where $H_{2}^{i}$ is the number of time slots the BS switching operation i has been selected in SB2 block in the source task as reminded in Table 1. Meanwhile, the observed reward associated with energy efficiency by selecting a BS switching operation i is $R^{i,h}_{S}\left (H_{2}^{i}\right)$ during $H_{2}^{i}$ source task observations in SB2.

At the end of each block b, Algorithm 1 returns a BS switching operation index maximizing the policy index, $B^{i,h}(n_{2},T^{i})\ \forall \ i\in \mathcal {K}$, which has to be selected for the next block of operation, i.e., steps ?? and ??. The index computation is done according to three terms:

$$\begin{array}{@{}rcl@{}} B^{i,h}\!\left(n_{2},T^{i}_{2}\right) \,=\, \bar{S}^{i,h}\left(T^{i}_{2}\right) \,-\, Q^{i,h}\left(n_{2},T^{i}_{2}\right) \,+\, A^{i,h}{\left(n_{2},T^{i}_{2}\right)},\ \!\forall i \end{array} $$

(9)

where $\bar {S}^{i,h}(T^{i}_{2})$ is the empirical mean of the observed states obtained with action i considering the time period in the source task and in the current task. As reminded in Table 1, $T_{2}^{i}$ is the number of times action i has been selected in SB2 block up to time n₂ in the current task. The empirical mean is expressed as

$$\begin{array}{@{}rcl@{}} \bar{S}^{i,h}\left(T^{i}_{2}\right) = \frac{\sum\limits_{t = 1}^{T^{i}_{2}}S^{i}(t) + \sum\limits_{t = 1}^{H_{2}^{i}} S^{i,h}(t)}{T^{i}_{2} + H_{2}^{i}}, \forall i. \end{array} $$

(10)

The second term, i.e., $Q^{i,h}\left (n_{2},T_{2}^{i}\right)$, is computed similarly than in RQoS-UCB policy [10] but including source task observations:

$$\begin{array}{@{}rcl@{}} Q^{i,h}\left(n_{2},T^{i}_{2}\right) = \frac{\beta M^{i,h}\left(n_{2},T^{i}_{2}\right) \ln\left(n_{2} + H_{2}^{i}\right)}{T^{i}_{2} + H_{2}^{i}}, \forall i, \end{array} $$

(11)

where,

$$M^{i,h}\left(n_{2},T^{i}_{2}\right) = G^{S}_{\max} - G^{i,h}_{S}\left(T^{i}_{2}\right),\ \forall i, $$

and $G^{i,h}_{S}\left (T^{i}_{2}\right) = \frac {1}{T^{i}_{2}}\sum _{k=1}^{T^{i}_{2}} R^{i}_{S}(k) + \frac {1}{H_{2}^{i}}\sum _{k=1}^{H_{2}^{i}} R^{i,h}_{S}(k)$ denotes the empirical mean of EE reward, i.e., $R^{i}_{S}$, collected in the current task in SB2 block by applying action i in state S plus the total mean EE reward gathered in source task. Moreover, $G^{S}_{\max } = \max _{i \in \mathcal {K}} G^{i,h}_{S}\left (T^{i}_{2}\right)$ is the maximum reward within the set of BS switching operations from current and historical observations in state S. Finally, the bias term $A^{i,h}{\left (n_{2},T^{i}_{2}\right)}$ is defined as

$$\begin{array}{@{}rcl@{}} A^{i,h}{\left(n_{2},T^{i}_{2}\right)} = \sqrt{\frac{\alpha \ln\left(n_{2} + H_{2}^{i}\right)}{T^{i}_{2} + H_{2}^{i}}}, \hspace{0.2cm} \forall i. \end{array} $$

(12)

It is worth noting that it exists a class of bandit algorithms that uses side information. This kind of bandits is sometimes called contextual bandit or bandit with feedback [43]. Expert systems described in [44] can also be seen as a generalization of learning with side observations. The main idea of bandits with side information is that at each time instant, and before taking a decision, the player is able to observe a realization of a random variable, or a linear function of it, that is called side information, in order to produce a next estimate closer to the real value that is searched. On the other hand, transfer learning aims at using the index Bⁱ of each arm i, computed previously, to initialize the algorithm in order to achieve a jump-start in the convergence rate, that makes transfer learning a quite different approach than bandits with side information.

3.4 Convergence analysis of TLEEM-UCB

In this section, the total number of suboptimal plays is upper-bounded and established under the following condition 1 on the arms.

Condition 1

All arms are finite-state, irreducible, aperiodic Markov chains whose transition probability matrices have irreducible multiplicative symmetrization, and the state of non-played arms may evolve.

Let us consider $G_{q}^{i} \geq \frac {1}{\hat {\pi }_{\max } + \pi ^{i}_{q}}$ and $ \beta \geq 84 S^{2}_{\max } r^{2}_{\max } G^{2}_{\max }\hat {\pi }^{2}_{\max }/\left (\epsilon _{\min } \Delta \mu _{i}^{R} M_{\min }\right)$. We present an upper bound on the total expected number of plays of suboptimal arms in Theorem 1.

Theorem 1

Assume all arms follow condition 1. Let $\pi _{\min } $, $\hat \pi _{\max } $, $S_{\max } $, $r_{\max }$, $\varepsilon _{\min }$, $M_{\min }$, Δμⁱ, and $\Omega ^{i}_{\max }$ defined as in Table 1. The total expected number of plays of suboptimal BS configuration is upper-bounded by:

$$ \begin{aligned} \mathbb{E}\left[T^{i,h}(n)\right] \leq \left(\frac{1}{\pi^{i}_{\min}} + \Omega^{i}_{\max} + 1\right) \left(l^{+} + \frac{4}{\pi_{\min}} \sum_{t=1}^{\infty} \left(t +H_{2}^{*}\right)^{-2} \right) \end{aligned} $$

where,

$$ l^{+} = \max\left(0, \frac{4 \alpha \ln {\left(n_{2} +H_{2}^{i}\right)}}{(\Delta \mu^{i})^{2}} - H_{2}^{i}\right) $$

(13)

Proof

A sketch of proof of Theorem 1 is provided in Appendix A and follows the same steps as in [10, Th. 1] considering transferred observations. □

Note that the above bound reduces to the bound of EEM-UCB policy, which is the bound of RQoS-UCB policy [10, Th. 1], when the transferred knowledge is not available (i.e., $H_{2}^{i} = 0, \forall i$).

3.5 Complexity and scalability issues

TLEEM-UCB is an index-based algorithm. The complexity of the index computation is small. At each iteration in SB2, it requires the evaluation of (10), (11), and (12). Equation (10) is nothing but a moving average that only requires one addition and one division at each iteration since other values have been recorded during previous iterations. Equation (11) requires a log operation, two multiplications (but one with a constant which is less complex than a multiplication between two varying terms), one division, and a moving average for evaluating M at each iteration. Finally, the bias term (12) requires a multiplication (with a constant) of the log term already computed once, a division and a square root evaluation at each iteration. The low amount of computation and the long period between two iterations makes it negligible compared to the simplest signal processing operation to be done at PHY layer for instance.

The algorithm complexity of TLEEM-UCB is linear with the number of combinations, but the later is exponential with the number of base stations, i.e., 2^Y−1. But this complexity is entirely concentrated in the first initialization phase where the algorithm explores all combinations once in order to give an index to each configuration. Once this has been done, only one BS configuration is tested for the index computation at a given time, hence with the computational complexity mentioned above. Moreover and thanks to transfer learning, initialization phase does not need to be repeated each day, since algorithm uses the best indexes previously learnt in historical periods, to start the new learning phase. A large network will impact the convergence time of the algorithm, since the best configuration needs to be found in a larger set cardinality, but it does not increase the computational complexity. The convergence time would be far too long for a network with 50 base stations for instance. However, one can imagine to have a learning algorithm to control a cluster of few base stations and not the entire network. Coordination among clusters could be done in a higher level in the network, but this is beyond the scope of the paper. Finally, it is worth noting that actor-critic algorithm in [12] and decentralized greedy in [28] belong to the larger class of Q-learning algorithms whose algorithmic complexity is significantly larger than the computation of an index in an UCB policy.

The algorithm relies on the feedback of energy consumption metric of each cell at the central entity. However, base stations already record the data rate and the transmit power allocated to each user. By monitoring also its own power consumption, an estimation of energy efficiency can be computed. Computed EE only needs to be transmitted over a certain number of bytes, to the central controller leading to a negligible amount of overhead added on fronthaul links. However, the time needed to put decision into action is not equal to zero. The exact evaluation of the time needed to collect measurements, providing a reward, switch on/off a set of BS, would take a certain amount of time that depends on the data used to build a statistic, e.g., average consumed power, and other technological constraints. These features are, of course, of great importance in a real deployment experiment but require much more investigations including implementation in a real platform and are left for further works. We will see in the next section that the proposed algorithms converge around 3000 iterations. If the time lag between the collection of data and the configuration change is 1 s, convergence occurs after 1 h. However, the algorithm continuously performs the index computation according to the received frames in the network and never stops running such that the base station configuration is continuously changed during the day according to the traffic measured in the network.

4 Results and discussion

In this section, the performance of our proposed energy-efficient dynamic BS operation algorithm is investigated through extensive simulations under practical configurations similar to [4, 12, 29]. We consider an heterogeneous cellular network topology consisting of 5 macro and 5 micro BS arbitrarily deployed in an area of 5×5 km². Furthermore, the call arrival rate at location x^k follows a Poisson point process with intensity Λ(x^k,n) which may vary between source and target task as summarized in Table 2 and the average file size of each call is 1/h(x^k,n)=100 kB.

Table 2 Simulation parameters

Full size table

Maximal macro and micro BS transmission powers are set to 20 and 1 W, respectively, while the maximum operational power consumption for macro and micro BS are 865 W and 38 W, respectively. The COST 231 modified path loss model is used for radio propagation environment, with macro and micro BS heights are set to 32 m and 12.5 m, respectively, similar than in [4, 12, 29]. In order to guarantee system reliability, system load threshold ρ_th=0.6 is considered for all BS [29] and the minimum bit rate $\Theta ^{\min }$ is set to 122 Kbps [40] for each active user. The intra-cell interference factor ϕ is set to 0.01, and the exploration parameters for EEM-UCB and TLEEM-UCB policies are α=0.25 and β=0.32. As per [4], a homogeneous user distribution with intensity Λ=10⁻⁴ corresponds to 10% of BS utilizations in a case where all BS are switched ON, this value is taken as reference in the analysis on the influence of traffic load variation on the performance of proposed policy. Table 2 summarizes all the parameters used for the simulations.

4.1 Convergence analysis

Figure 4 compares the convergence behaviors of the proposed EEM-UCB and TLEEM-UCB algorithms w.r.t. the actor critic (ACT) [45], decentralized greedy [28], and transfer actor critic (TACT) [12] policies. The cumulative energy efficiency ratio (CEER) is presented for all policies in Fig. 4 which is defined as

$$\text{CEER}_{\pi} = \frac{\text{EE policy $\pi$}}{\text{EE when all BS are ON}} $$

Moreover, the global optimal solution achieved by an exhaustive search, and referred as ideal policy, is also shown in Fig. 4. The figure shows the behaviors of the policies in terms of CEER after 3000 iterations for 4 configurations of arrival rates in source, i.e., Λ^source, and current tasks, i.e., Λ^target. These curves can be seen as the evolution of the network EE at a given hour of a day with a given arrival rate.

As depicted in Fig. 4, the network utilities of all algorithms tends to increase with time since their confidence on the best deployment strategy increases as the time elapses. However, the performance of all algorithms largely depends on the difference between the source and target task arrival rates. Our policies, EEM-UCB and TLEEM-UCB, converge toward the ideal policy, while ACT, TACT, and decentralized greedy algorithms achieve a far suboptimal solution after 3000 iterations. The lower convergence rate of ACT and TACT algorithms is clear regarding these results. Our policy TLEEM-UCB generally performs better than all the others except when the source and target arrival rates are quite different, i.e., Fig. 4d. From Fig. 4a to d, the source arrival rate is fixed to Λ^source=0.05×10⁻⁴ and the target arrival rate varies from Λ^target=0.05×10⁻⁴ to Λ^target=2×10⁻⁴. The transfer learning procedure is the most beneficial when Λ^source=Λ^target, Fig. 4a, since TLEEM-UCB achieves performance jump-start in the beginning and quickly converge toward the best configuration. On the contrary on Fig. 4d, when the source and target arrival rates are significantly different, i.e., Λ^source=0.05×10⁻⁴ and Λ^target=2×10⁻⁴, transferred knowledge impacts the learning in a negative way, and thus, TLEEM-UCB performs worse than EEM-UCB. In that case, it is more beneficial to learn from scratch since the previously computed indexes Bⁱ ∀i has to be forgotten to learn a better configuration. From these results, we can state that temporal knowledge transfer improves the convergence speed of classical MAB approaches, but it also affects in a negative manner if traffic loads in a source and target environments are significantly different. The execution time needed for convergence does not exceed few minutes in a standard simulation platform using MATLAB, since one iteration basically consists in the computation of the index B^i,h, which does not exceed few milliseconds.

Figure 5 presents the improvement in CEER of our algorithm when using TL concept compared to non-transferred knowledge, i.e.,

$$\begin{aligned} \text{CEER performance of improvement}\\ = \frac{\text{CEER}_{\mathrm{TLEEM-UCB}} - \text{CEER}_{\mathrm{EEM-UCB}}} {\text{CEER}_{\mathrm{EEM-UCB}}} \end{aligned} $$

One can observe that the TL concept allows a performance jump-start at the early iterations compared to the simple EEM-UCB. The maximum rate of improvement is around 500 iterations and is as much better than the source and target arrival rates are similar. For instance, a gain about 28% is achieved after 500 iterations when Λ^source=Λ^target=0.05 × 10⁻⁴ but reduces to only 5% when Λ^target=2×10⁻⁴. For this setting, the improvement of TLEEM-UCB w.r.t. EEM-UCB is even negative after 3000 iterations, i.e., - 5%, meaning that TL has a negative impact on the long run on the network EE compared to EEM-UCB.

Finally, Fig. 6 shows how the network energy efficiency decreases when the number of BS increases. In that figure, the percentages of macro and micro BS are 50–50%, and the same settings than on Table 2 are used. Network EE reduces because the optimal configuration is not necessarily achieved after 3000 iterations, specially for high number of BS. Hence, the selected configuration is not the optimal one and the gap increases as the number of BS increases as it can be inferred with the ideal policy. The larger the number of base stations, the larger the exploration space cardinality making the convergence to the optimal configuration longer. EE achieved with the ideal policy always increases or remains constant w.r.t. the number of BS. Indeed, the larger the BS density, the larger the data rate, at least up to a limit where the interference generated by co-channel transmissions prevents from increasing the spectral efficiency. Hence, always selecting the best configuration of active BS increases the data rate and hence EE in this configuration. We can also note the gain of learning policies compared to a scenario where all BS are always ON. It is also worth mentioning that the achieved EE with decentralized greedy finishes to outperform TLEEM-UCB when the number of BS is larger than 15 in this configuration, due to a more efficient search of a local optimum when the problem dimension begins to be high. However, decentralized greedy requires to exchange information between nodes, and hence, the traffic overhead increases as the number of BS increases.

4.2 Performance under periodic traffic load

We also investigate the effectiveness of the proposed learning framework when traffic loads periodically fluctuates. As stated in Section 2, real traffic load follows a periodical pattern that can be approximated by a sinusoidal function as in [29].

Figure 7 compares the network EE achieved with our policies, i.e., EEM-UCB and TLEEM-UCB, with the previously introduced state-of-the-art algorithms, i.e., ACT, TACT, and decentralized greedy, when traffic load is fluctuating during the day. One can first remark that all policies behave the same, except decentralized greedy which is inferior, at high traffic load from noon to 22 h. Indeed, in high traffic load, all BS need to be switched ON in order to satisfy the demand, and hence, all learning policies logically converge to the full deployment. Decentralized greedy tends to switch OFF some BS, even in high traffic load, in order to save energy leading to a loss in EE. On the other hand, in lower traffic period, i.e., night time, less BS need to be switched ON to meet the QoS requirements, and hence, learning strategies make sense to optimize the network EE. In these time slots (1 am to 8 am), TLEEM-UCB achieves significantly higher EE compared to other algorithms from literature and reaches 95% of the maximum achievable EE, i.e., ideal policy. Moreover, TLEEM-UCB outperforms its counterpart, i.e., TACT, of about 34%. It also confirms that transferred learning improves the performance compared to non-transferred knowledge policy, i.e., EEM-UCB, of about 23%.

Figure 8 depicts the average percentage of energy savings achieved by the learning algorithms and the ideal policy during 1 day. The energy saving percentage is measured w.r.t. to the energy expenditure of a full deployment. As shown in Fig. 8, a large amount of energy saving is achieved by the proposed TLEEM-UCB policy, e.g., about 70% during low traffic load period (night time). Moreover, the difference between the ideal policy and TLEEM-UCB policy is less than 5%. On the contrary, ACT, TACT, and EEM-UCB algorithms achieve only about 60% of energy saving. Decentralized greedy procedure allows the most important energy saving gain which nearly equals the ideal policy performance in the night time. One can also remark that the later policy allows 20% of energy saving during high traffic period by putting more BS into sleep mode, since the energy consumption is privileged. This improvement comes at the cost of user experience and comparatively less network EE as it has been observed in Fig. 7.

The impact of learning algorithms on the actual deployment of the network is also of great interest for operators. Figures 9 and 10 represent the average number of active BS and the average number of switches that are performed at each time of the day, respectively. Figure 9 gives insight on the average number of BS that is needed to be switched ON in order to meet the traffic variation along the day. As expected, the average number of BS needed at the night time is less than the one required at the peak period, leading to an increase of network EE and a decrease of energy consumption in night time as corroborated by Figs. 7 and 8, respectively. During the night time, decentralized greedy, TACT, and ACT are the policies activating the less number of BS in that order. Our policies come after with an average of 5 BS switched ON, close to the optimal average number around 5.5, allowing higher EE than their counterparts. During the peak load in the afternoon, almost all policies activate the whole set of BS. Decentralized greedy fluctuates around 8.5 BS in average allowing larger energy saving gain but lower EE. It is worth noting that the proposed policies, i.e., EEM-UCB and TLEEM-UCB, activate more micro BS than macro BS to cope with the varying traffic load and to save energy in the same time.

The results presented in Fig. 10 are important because of some practical constraints, i.e., time needed to turn ON/OFF the power amplifier (PA), lifetime of the PA. Indeed, if a learning policy requires to switch PA too often in each time slot, then it will significantly reduce the lifetime of PA and may cause additional power loss due to the initial burst of power consumption when an equipment is switched ON (not taken into account in this work). Our proposed policy, TLEEM-UCB, requires an average about 2 BS mode switches at each time slot in low load period (night time) which is significantly less compared to 5 mode switches with ACT and TACT algorithms in the same period while a little more than three switches are needed without TL, i.e., EEM-UCB. All algorithms but decentralized greedy do not require BS switches during high traffic periods, whereas decentralized greedy requires between 1 and 2 BS switches all along the day, irrespective of the traffic load.

To conclude this part and to shed the light on the transfer learning feature, let us assume that the average traffic load variation is very small from 1 day to another at the same daytime, i.e., Λ^source≈Λ^target, which is a reasonable assumption, excepted between a weekday and a weekend day or between two consecutive days with occasional and exceptional events as it has been reported in [29]. As mentioned in Section 3.5, if 1 s is taken between two configuration switches, stable configuration is roughly achieved after 1 h, which may appear relatively high. However, the applicability of TLEEM-UCB has to be thought on the long run, e.g., 1 week. Indeed, let us consider the particular time range 10:00–11:00 am during the week. On Monday, the algorithm runs during 1 h and saves the configuration achieved at 11:00 am. The next day at 10:00 am, the network just applies the configuration learned the day before and keeps like it is all along the week on the range 10:00–11:00 am without running again the algorithm. This strategy could be applied for each 1 h slot of the day during the first or two first days of the week and network just applies the computed configurations at each time slot for the rest of the week. Of course, this strategy does not work if important variations of the average traffic at a given time and between two days are observed, as it can be inferred in Fig. 4d.

5 Conclusion

In this paper, the problem of BS switching operation for EE maximization in heterogeneous wireless cellular network has been tackled under the restless MAB framework. A reinforcement learning algorithm originally proposed in OSA scenario has been adapted to deal with the optimal BS deployment in order to increase the global network EE. Furthermore and in order to increase the convergence rate of our EEM-UCB algorithm, we proposed to use the learnt knowledge acquired in previous time periods, leading to TLEEM-UCB policy. Our proposed algorithm has been proven to converge to an optimal solution as long as Markov chains governing the arms obey to certain conditions. Extensive numerical analysis shown the ability of our proposed policy to converge to the optimal deployment, maximizing EE. Transfer learning has been shown to be an effective solution to increase the convergence rate of our UCB algorithm when source and target arrival rates are not too different. Moreover, our policies have been shown to be able to follow a practical periodic traffic fluctuation. TLEEM-UCB can achieve 95% of EE achieved by the optimal BS configuration and up to 70% energy saving gain when traffic load is low (night time). Future work may include other index-based policies, such as Thomson sampling or Bayesian-UCB, that are known for their high performance in terms of regret in other scenarios. Moreover, spatial knowledge transfer between cells may also be of great interest for operators in a dynamic environment.

6 \thelikesection Appendix A: Sketch of proof of Theorem 1

The regret of TLEEM-UCB policy is governed by the expected number of plays, $\mathbb {E}[T^{i,h}(n)]$, for any suboptimal BS switching operation i. Let l be a positive integer. Let us remind that $\mu ^{i}=\sum _{S \in \mathcal {S}} S^{i} G^{i}_{S} \pi ^{i}_{S}$. Following the steps as in [10] and including the historic time, the number of blocks a BS switching operation (action) i has been selected up to block b(n) can be expressed as

$$\begin{array}{@{}rcl@{}} F^{i,h}(b) &\,=\,& 1 + \!\sum_{t=2^{Y}}^{b} \mathbbm{1}\{ \textbf {a}(t)=i\} \\ F^{i,h}(b) &=& l + \sum_{t=2^{Y}}^{b} \mathbbm{1}\left\{ \textbf{a}(t)=i, T_{2}^{i}(t-1) \geq l \right\} \\ &=& l + \sum_{t=2^{Y}}^{b} \mathbbm{1}\left\{ B^{*,h}\left(t-1,T_{2}^{*}(t-1)\right) \leq B^{i,h}\right. \\ &&\left.\left(t-1,T_{2}^{i}(t-1)\right), T_{2}^{i}(t-1) \geq l\right\} \\ \\ &\!\leq\!& l \,+\, \sum_{t=2^{Y}}^{b} \!\mathbbm{1}\!\left\{ \exists \omega^{i}: l \!\leq\! \omega^{i} \!\leq\! t-1, B^{i,h}(\omega^{i},t) > \mu^{*} \right\} \\ &+& \mathbbm{1}\left\{ \exists \omega^{*}: 1 \leq \omega^{*} \leq t-1, B^{*,h}(\omega^{*},t) \leq \mu^{*} \right\} \end{array} $$

(14) (15) (16) (17)

where the lower bound in the summation in (14) comes from the fact that each BS configuration is tried at least once, (15) comes from the fact that each action has been sensed at least l blocks up to block b. Equation (16) comes from the reason why suboptimal action i is chosen, i.e., the index of the optimal action at block t−1, $B^{*,h}\left (T_{2}^{*}(t-1),t-1\right)$, is below the index of the suboptimal action i. Moreover, (16) is upper-bounded by (17) because these two conditions are not exclusive. Taking the expectation on both sides and using union bound, we get:

$$\begin{array}{@{}rcl@{}} \mathbb{E}\left[F^{i,h}(b)\right] \leq l + \sum_{t=1}^{\infty}\sum_{\omega^{i}=l}^{t-1} \mathbb{P}\left(B^{i,h}(\omega^{i},t) > \mu^{*}\right) \\+ \sum_{t=1}^{\infty} \sum_{\omega^{*}=1}^{t-1} \mathbb{P}\left(B^{*,h}(\omega^{*},t) \leq \mu^{*}\right) \end{array} $$

(18)

The summation over t starts from 1 instead of 2^Y because it does not change the validity of the upper bound. Let us remind that $G^{i,h}_{S}\left (T^{i}_{2}\right) = \frac {1}{T^{i}_{2}}\sum _{k=1}^{T^{i}_{2}} R^{i}_{S}(k) + \frac {1}{H_{2}^{i}}\sum _{k=1}^{H_{2}^{i}} R^{i,h}_{S}(k)$ denotes the empirical mean of quality observations $R^{i}_{S}$ for action i in state S, $G^{S}_{\max } = \max _{i \in \mathcal {K}} G^{i,h}_{S}(T^{i}_{2})$ is the maximum empirical reward, and G^∗ is the empirical mean of the reward of the optimal action ∗, optimal in terms of both state and EE reward μ^∗. It does not necessarily mean that $G^{*}=G^{S}_{\max }$. Moreover, let us remind that Δμⁱ=μ^∗−μⁱ. Let us choose $l = \Big \lceil \frac {4 \alpha \ln {\left (n_{2} +H_{2}^{i}\right)}}{(\Delta \mu ^{i})^{2}} \Big \rceil $ and proceed from (18):

$$\begin{array}{@{}rcl@{}} \mathbb{E}[F^{i,h}(b)] &\leq& \Bigg \lceil \frac{4 \alpha \ln {\left(n_{2} +H_{2}^{i}\right)}}{(\Delta \mu^{i})^{2}} \Bigg \rceil \\&+& \sum_{t=1}^{\infty}\sum_{\omega^{i}= \Big \lceil \frac{4 \alpha \ln {\left(n_{2} +H_{2}^{i}\right)}}{(\Delta \mu^{i})^{2}} \Big \rceil}^{t-1} \mathbb{P}\left(B^{i,h}(\omega^{i},t) > \mu^{*}\right) \\ &\,+& \sum_{t=1}^{\infty} \sum_{\omega^{*}= 1}^{t-1} \mathbb{P}\left(B^{*,h}(\omega^{*},t) \leq \mu^{*}\right) \end{array} $$

(19)

We first start bounding the first part of (19), i.e., $ \mathbb {P}\left (B^{i,h}(\omega ^{i},t) > \mu ^{*}\right)$. Substituting B^i,h(ωⁱ,t) by its expression and following the same steps than in [10, Appendix A] but using the number of times action i has been selected in SB2 block, i.e., $H^{i}_{2}$, we end up with

$$\begin{array}{@{}rcl@{}} &\mathbb{P}&\left(B^{i,h}(\omega^{i},t) > \mu^{*}\right) \\&\leq\!& \sum_{S \in \mathcal{S}} N_{\mathbf{h}^{i}} \exp\!\left(\!- \frac{ \left(\omega^{i} + H_{2}^{i}\right) \left(\frac{ \frac{\Delta \mu^{i}}{2} + D^{i,h}(\omega^{i},t)}{r^{i}_{S} |\mathcal{S}| G^{i,h}_{S}\hat{\pi}^{i}_{S}} \right)^{2} \varepsilon^{i} }{28}\right) \end{array} $$

(20)

where $\hat \pi _{S}^{i} = \max \left \{ \pi ^{i}_{S}, 1 - \pi ^{i}_{S}\right \}$, $\hat {\pi }_{\max } = \max _{i \in \mathcal {K}} \hat {\pi }^{i}_{S}$, and $\epsilon ^{i} = 1 - \lambda _{2}^{i}$ are the eigenvalue gap of action i, defined as the difference between 1 and the second largest eigenvalue of the ith Markov chain. Moreover, (20) follows from [46, Th. 3.3] and from [47, Lem. 2.1] by considering n=ωⁱ, $f\left (S^{i}_{t}\right) = \frac {\mathbb{1}\{S^{i}_{t} = S\} - G^{i,h}_{S}\pi ^{i}_{S}}{G^{i,h}_{S}\hat {\pi }^{i}_{S}}$. The conditions of [46, Th. 3.3] are fulfilled if $G^{i,h}_{S} \geq \frac {1}{ \hat {\pi }_{\max } + \pi ^{i}_{S}}$. Consider an initial distribution hⁱ as defined in [41], $N_{\mathbf {h}^{i}}$ can be upper-bounded by $1/\pi _{\min }$ where $\pi _{\min } = \min _{S\in \mathcal {S}} \pi ^{i}_{S}$. By following the same steps than in [10, Appendix A], we get from (20),

$$ \mathbb{P}\left(B^{i,h}(\omega^{i},t) > \mu^{*}\right) \leq \frac{\left|\mathcal{S}^{i} \right|}{\pi_{\min}} \left(t +H_{2}^{i}\right)^{-\frac{ \Delta \mu^{i} \beta M_{\min} \varepsilon_{\min}}{28 S^{2}_{\max} r^{2}_{\max} G^{2}_{\max}\hat{\pi}^{2}_{\max}}} $$

(21)

where $G_{\max }\equiv G^{S}_{\max }$, $r_{\max } = \max _{S \in \mathcal {S}, i \in \mathcal {K}} r^{i}_{S}$, $M_{\min } = \min _{i \in \mathcal {K}} M^{i,h}\left (\omega ^{i} \right)$, $S_{\max } = \max _{i\in \mathcal {K}} \left |\mathcal {S}^{i}\right |$, and $\epsilon _{\min } = \min _{i\in \mathcal {K}} \epsilon ^{i}$. Inserting (21) into first part of (19), and following the same steps than in [10], we end up with

$$ \sum_{t=1}^{\infty} \sum_{\omega^{i}=l}^{t-1} \mathbb{P}\left(B^{i,h}(\omega^{i},t) \geq \mu^{*}\right) \leq \frac{|\mathcal{S}^{i}|}{\pi_{\min}} \sum_{t=1}^{\infty} \left(t +H_{2}^{i}\right)^{-2} $$

(22)

where (22) is obtained for of $\beta \geq 84 S^{2}_{\max } r^{2}_{\max } G^{2}_{\max }\hat {\pi }^{2}_{\max }/$$\left (\varepsilon _{\min } \Delta \mu _{i}^{R} M_{\min }\right)$.

Similarly, one can bound the second part of (19) by following the same ideas than previously and applying the same steps than in [10, Appendix A] but introducing $H_{2}^{*}$, i.e., the number of times the best action has been chosen in historical period in SB2 block, we get

$$ \mathbb{P}\left(B^{*,h}(\omega^{*},t) \leq \mu^{*}\right) \leq \frac{\left| \mathcal{S}^{*} \right| }{\pi_{\min}} \left(t +H_{2}^{*}\right)^{-\frac{\varepsilon_{\min} \left(\alpha - 2\sqrt{\alpha} \beta M_{\max}\right)}{28 \left(S_{\max} G_{\max} r_{\max} \hat\pi_{\max} \right)^{2} }} $$

(23)

where $M_{\max } = \max _{i \in \mathbb {K}} M^{i,h}\left (\omega ^{i} \right)$. By choosing α such that $\frac {\varepsilon _{\min } \left (\alpha - 2\sqrt {\alpha } \beta M_{\max }\right)}{28 \left (S_{\max } G_{\max } r_{\max } \hat \pi _{\max } \right)^{2}} \geq 3$, we obtain

$$ \mathbb{P}\left(B^{*,h}(\omega^{*},t) \leq \mu^{*}\right) \leq \frac{\left| \mathcal{S}^{*} \right| }{\pi_{\min}} \left(t +H_{2}^{*}\right)^{-3} $$

(24)

Substituting (24) into the second part of (19), we get

$$ \sum_{t=1}^{\infty} \sum_{\omega^{*}=1}^{t-1} \mathbb{P}\left(B^{*,h}(\omega^{*},t) \leq \mu^{*}\right) \leq \frac{|\mathcal{S}^{*}| }{\pi_{\min}} \sum_{t=1}^{\infty} \ \left(t +H_{2}^{*}\right)^{-2} $$

(25)

Furthermore, due to the presence of transferred knowledge, we consider $ l^{+} = \max \left (0, \frac {4 \alpha \ln {\left (n_{2} +H_{2}^{i}\right)}}{(\Delta \mu ^{i})^{2}} - H_{2}^{i}\right) $ instead of l and the following bound follows from combining (22) and (25). Then, from (18):

$$ \mathbb{E}[F^{i,h}(b)] \!\leq\! l^{+} + \frac{|\mathcal{S}^{*}|}{\pi_{\min}} \!\sum_{t=1}^{\infty} \left(t +H_{2}^{*}\right)^{-2} + \frac{|\mathcal{S}^{i}|}{\pi_{\min}} \!\sum_{t=1}^{\infty} \!\left(t \,+\,H_{2}^{i}\right)^{-2} $$

(26)

Note that all observations in calculating the EEM-UCB indices come from the SB2 block. Let SB2 block begin with observing regenerative state ζⁱ and end with a return to the same ζⁱ. The total number of time of suboptimal action i is selected at the end of block b(n) is estimated by considering the observations acquired in (i) the total number of plays of suboptimal action i during SB2 block (upper-bounded by $1/ \pi ^{i}_{\min }$), (ii) the total number of selections in SB1 before entering the SB2 block (upper-bounded by $\Omega ^{i}_{\max }$), and (iii). Finally, one more selection resulting from the SB3 block which is state ζⁱ. Thus, we have

$$\mathbb{E}\left[T^{i,h}(n)\right] \leq \left(\frac{1}{\pi^{i}_{\min}} + \Omega^{i}_{\max} + 1\right) \mathbb{E}\left[F^{i}(b(n))\right] $$

Moreover, since $\left |\mathcal {S}^{*}\right | = \left |\mathcal {S}^{i}\right | = 2$ and $S_{\max } = 2$, $r_{\max } = 1$ in our case, Theorem 1 follows.

Availability of data and materials

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Notes

Denote that other user association metrics could also be used. The optimal user association problems have been well addressed in [36–38]; however, we focus on the BS sleeping scheme rather than user association due to the space limitation.

Abbreviations

ACT:: Actor critic
BS:: Base station
CEER:: Cumulative energy efficiency ratio
EE:: Energy efficiency
EEM-UCB:: Energy efficiency maximization - upper confidence bound
MAB:: Multi-armed bandit
MDP:: Markov decision process
MS:: Mobile station
OSA:: Opportunistic spectrum access
PA:: Power amplifier
RL:: Reinforcement learning
SB:: Sub-block
SINR:: Signal to interference and noise ratio
TACT:: Transfer actor-critic
TLEEM-UCB:: Transfer learning energy efficiency maximization - upper confidence bound
QoS:: Quality of service
RQoS-UCB:: Restless quality of service - upper confidence bound
UCB:: Upper confidence bound

References

N. Modi, Machine learning and statistical decision making for green radio (2017). PhD thesis, CentraleSupelec, Rennes.
M. A. Marsan, L. Chiaraviglio, D. Ciullo, M. Meo, in IEEE International Conference on Communications Workshops (ICCW). Optimal energy savings in cellular access networks, (2009), pp. 1–5. https://doi.org/10.1109/iccw.2009.5208045.
G. P. Fettweis, E. Zimmermann, in The 11th International Symposium on Wireless Personal Multimedia Communications (WPMC). ICT energy consumption-trends and challenges, (2009).
K. Son, H. Kim, Y. Yi, B. Krishnamachari, Base station operation and user association mechanisms for energy-delay tradeoffs in green cellular networks. IEEE J. Sel. Areas Commun.29(8), 1525–1536 (2011).
Article Google Scholar
C. Peng, S. -B. Lee, S. Lu, H. Luo, H. Li, in The 17th Annual International Conference on Mobile Computing and Networking (MobiCom). Traffic-driven power saving in operational 3G cellular networks (ACMNew York, 2011), pp. 121–132.
Google Scholar
H. Karl, An overview of energy-efficiency techniques for mobile communication systems. 2003.
E. Oh, B. Krishnamachari, X. Liu, Z. Niu, Toward dynamic energy-efficient operation of cellular network infrastructure. IEEE Commun. Mag.49(6), 56–61 (2011).
Article Google Scholar
N. Modi, P. Mary, C. Moy, in 2012 Conference Record of the Forty Sixth Asilomar Conference on Signals, Systems and Computers (ASILOMAR). A sensing policy based on confidence bounds and a restless multi-armed bandit model (San Diego, 2012), pp. 318–323.
C. Robert, C. Moy, C. -X. Wang, in IEEE International Conference on Communications (ICC). Reinforcement learning approaches and evaluation criteria for opportunistic spectrum access (Sydney, 2014), pp. 1508–1513.
N. Modi, P. Mary, C. Moy, QoS driven channel selection algorithm for cognitive radio network: Multi-user multi-armed bandit approach. IEEE Trans. Cogn. Commun. Netw.3(1), 49–66 (2017).
Article Google Scholar
M. E. Taylor, P. Stone, Transfer learning for reinforcement learning domains: A survey. J. Mach. Learn. Res.10:, 1633–1685 (2009).
MathSciNet MATH Google Scholar
R. Li, Z. Zhao, X. Chen, J. Palicot, H. Zhang, TACT: A transfer actor-critic learning framework for energy saving in cellular radio access networks. IEEE Trans. Wirel. Commun.13(4), 2000–2011 (2014).
Article Google Scholar
Z. Niu, TANGO: Traffic-aware network planning and green operation. IEEE Wirel. Commun.18(5), 25–29 (2011). https://doi.org/10.1109/MWC.2011.6056689.
Article Google Scholar
L. Chiaraviglio, D. Ciullo, M. Meo, M Ajmone Marsan, in The 11th International Symposium on Wireless Personal Multimedia Communications (WPMC). Energy-aware UMTS access networks, (2008), pp. 8–11.
Z. Niu, Y. Wu, J. Gong, Z. Yang, Cell zooming for cost-efficient green cellular networks. IEEE Commun. Mag.48(11), 74–79 (2010). https://doi.org/10.1109/MCOM.2010.5621970.
Article Google Scholar
R. Li, Z. Zhao, Y. Wei, X. Zhou, H. Zhang, in IEEE International Conference on Communications (ICC). GM-PAB: A grid-based energy saving scheme with predicted traffic load guidance for cellular networks, (2012), pp. 1160–1164. https://doi.org/10.1109/ICC.2012.6364637.
J. Gong, S. Zhou, Z. Niu, A dynamic programming approach for base station sleeping in cellular networks. IEICE Trans. Commun.95:, 551–562 (2012). https://doi.org/10.1587/transcom.E95.B.551.
Article Google Scholar
M. A. Marsan, L. Chiaraviglio, D. Ciullo, M. Meo, in IEEE International Conference on Communications Workshops (ICCW). Optimal energy savings in cellular access networks, (2009), pp. 1–5. https://doi.org/10.1109/ICCW.2009.5208045.
M. A. Marsan, M. Meo, Energy efficient management of two cellular access networks. SIGMETRICS Perform. Eval. Rev.37(4), 69–73 (2010). https://doi.org/10.1145/1773394.1773406.
Article Google Scholar
A. S. Alam, L. S. Dooley, A. S. Poulton, in 2013 IEEE 18th International Workshop on Computer Aided Modeling and Design of Communication Links and Networks (CAMAD). Traffic-and-interference aware base station switching for green cellular networks (Berlin, 2013), pp. 63–67.
E. Oh, B. Krishnamachari, in IEEE Global Telecommunications Conference (GLOBECOM). Energy savings through dynamic base station switching in cellular wireless access networks (Miami, 2010), pp. 1–5.
R. M. Karp, in Reducibility among Combinatorial Problems, ed. by R. E. Miller, J. W. Thatcher, and J. D. Bohlinger. Complexity of Computer Computations (SpringerBoston, 1972), pp. 85–103.
M. R. Garey, D. S. Johnson, Computers and intractability: A guide to the theory of NP-completeness (W. H. Freeman & Co., New York, 1979).
MATH Google Scholar
F. Han, Z. Safar, K. J. R. Liu, Energy-efficient base-station cooperative operation with guaranteed QoS. IEEE Trans. Commun.61(8), 3505–3517 (2013). https://doi.org/10.1109/TCOMM.2013.061913.120743.
Article Google Scholar
Y. S. Soh, T. Q. S. Quek, M. Kountouris, H. Shin, Energy efficient heterogeneous cellular networks. IEEE J. Sel. Areas Commun.31(5), 840–850 (2013).
Article Google Scholar
J. Kim, H. W. Lee, S. Chong, in 13th International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOpt). TAES: Traffic-aware energy-saving base station sleeping and clustering in cooperative networks (Mumbai, 2015), pp. 259–266.
V. Konda, V. Borkar, Energy-efficient base-station cooperative operation with guaranteed QoS. SIAM J. Contr. Optim.38(1), 94–123 (2013).
Article Google Scholar
W. -T. Wong, Y. -J. Yu, A. -C. Pang, in IEEE Global Communications Conference (GLOBECOM). Decentralized energy-efficient base station operation for green cellular networks (Anaheim, 2012), pp. 5194–5200.
E. Oh, K. Son, B. Krishnamachari, Dynamic base station switching-on/off strategies for green cellular networks. IEEE Trans. Wirel. Commun.12(5), 2126–2136 (2013).
Article Google Scholar
S. Zhou, J. Gong, Z. Yang, Z. Niu, P. Yang, Green mobile access network with dynamic base station energy saving. ACM MobiCom. 9(262), 10–12 (2009).
Google Scholar
W. Guo, T O’Farrell, Dynamic cell expansion with self-organizing cooperation. IEEE J. Sel. Areas Commun.31(5), 851–860 (2013). https://doi.org/10.1109/JSAC.2013.130504.
Article Google Scholar
C. Tekin, M. Liu, Online learning of rested and restless bandits. IEEE Trans. Inf. Theory. 58(8), 5588–5611 (2012). https://doi.org/10.1109/TIT.2012.2198613.
Article MathSciNet Google Scholar
J. Oksanen, V. Koivunen, H. V. Poor, A sensing policy based on confidence bounds and a restless multi-armed bandit model. CoRR abs/1211.4384 (2012).
J. Oksanen, V. Koivunen, An order optimal policy for exploiting idle spectrum in cognitive radio networks. IEEE Trans. Signal Process.63(5), 1214–1227 (2015). https://doi.org/10.1109/TSP.2015.2391072.
Article MathSciNet Google Scholar
W. Zhang, in Proceedings of the 19th International Teletraffic Congress. Performance of real-time and data traffic in heterogeneous overlay wireless networks, (2005).
M. F. Hossain, K. S. Munasinghe, A. Jamalipour, Distributed inter-BS cooperation aided energy efficient load balancing for cellular networks. IEEE Trans. Wirel. Commun.12(11), 5929–5939 (2013).
Article Google Scholar
H. Kim, G. de Veciana, X. Yang, M. Venkatachalam, Distributed α-optimal user association and cell load balancing in wireless networks. IEEE/ACM Trans. Netw.20(1), 177–190 (2012).
Article Google Scholar
K. Son, S. Chong, G. D. Veciana, Dynamic association for load balancing and interference avoidance in multi-cell networks. IEEE Trans. Wirel. Commun.8(7), 3566–3576 (2009).
Article Google Scholar
A. J. Fehske, F. Richter, G. P. Fettweis, in IEEE Globecom Workshops. Energy efficiency improvements through micro sites in cellular mobile radio networks (Honolulu, 2009), pp. 1–5.
A. Alam, L. Dooley, in IEEE Wireless Communications and Networking Conference. A scalable multimode base station switching model for green cellular networks (New Orleans, 2015).
C. Tekin, M. Liu, in The 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton). Online algorithms for the multi-armed bandit problem with Markovian rewards (IEEEAllerton, 2010), pp. 1675–1682.
Google Scholar
C. Tekin, M. Liu, in IEEE INFOCOM. Online learning in opportunistic spectrum access: A restless bandit approach (Shanghai, 2011), pp. 2462–2470.
C. -C. Wang, S. R. Kulkarni, H. V. Poor, Bandit problems with side observations. IEEE Trans. Autom. Control.50(3), 338–355 (2005). https://doi.org/10.1109/TAC.2005.844079.
Article MathSciNet Google Scholar
N. Cesa-Bianchi, G. Lugosi, Prediction, learning and games (Cambridge University Press, New York, 2006).
Book Google Scholar
R. Li, Z. Zhao, X. Chen, H. Zhang, in IEEE Global Communications Conference (GLOBECOM). Energy saving through a learning framework in greener cellular radio access networks (Anaheim, 2012), pp. 1556–1561.
P. Lezaud, Chernoff-type bound for finite markov chains. Ann. Appl. Probab.8:, 849–867 (1998).
Article MathSciNet Google Scholar
V. Anantharam, P. Varaiya, J. Walrand, Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-part II: Markovian rewards. IEEE Trans. Autom. Control. 32(11), 977–982 (1987).
Article Google Scholar

Download references

Acknowledgements

Not applicable.

Funding

This work has received a French government support granted to the CominLabs excellence laboratory and managed by the National Research Agency in the “Investing for the Future” program under reference no. ANR-10-LABX-07-01. The authors would also like to thank the Region Bretagne, France, for its support of this work.

Author information

Navikkumar Modi, Philippe Mary and Christophe Moy contributed equally to this work.

Authors and Affiliations

Brussels Airport Company, Zaventem, BE-1930, Belgium
Navikkumar Modi
Univ. Rennes, INSA de Rennes, CNRS, IETR - UMR 6164, Rennes, F-35000, France
Philippe Mary
Univ. Rennes, CNRS, IETR - UMR 6164, Rennes, F-35000, France
Christophe Moy

Authors

Navikkumar Modi
View author publications
You can also search for this author in PubMed Google Scholar
Philippe Mary
View author publications
You can also search for this author in PubMed Google Scholar
Christophe Moy
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

NM has provided the scientific and technical contents of the paper, deriving analytical results and numerical simulations. The three authors have contributed to the writing of the paper. All authors read and approved the final manuscript.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Work performed during the PhD thesis of Navikkumar Modi at CentraleSupelec, France \citeNavikPhD

Navikkumar Modi, Philippe Mary and Christophe Moy are equal contributors.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Modi, N., Mary, P. & Moy, C. Transfer restless multi-armed bandit policy for energy-efficient heterogeneous cellular network. EURASIP J. Adv. Signal Process. 2019, 46 (2019). https://doi.org/10.1186/s13634-019-0637-1

Download citation

Received: 26 July 2018
Accepted: 23 August 2019
Published: 21 October 2019
DOI: https://doi.org/10.1186/s13634-019-0637-1

Transfer restless multi-armed bandit policy for energy-efficient heterogeneous cellular network

Abstract

1 Introduction

1.1 Related work

1.2 Contributions

1.3 Paper structure

2 Methods and problem formulation

2.1 Network model

2.1.1 Traffic profile

2.1.2 BS selection rule

2.1.3 Channel model

2.1.4 System load

2.1.5 Power consumption model

2.2 Problem formulation

3 RL for energy-efficient network

3.1 System model

3.2 Restless energy efficiency maximization - upper confidence bound (EEM-UCB)

3.3 Transfer learning EEM-UCB (TLEEM-UCB) policy

3.4 Convergence analysis of TLEEM-UCB

Condition 1

Theorem 1

Proof

3.5 Complexity and scalability issues

4 Results and discussion

4.1 Convergence analysis

4.2 Performance under periodic traffic load

5 Conclusion

6 \thelikesection Appendix A: Sketch of proof of Theorem 1

Availability of data and materials

Notes

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords