 Research
 Open access
 Published:
A deep reinforcement approach for computation offloading in MEC dynamic networks
EURASIP Journal on Advances in Signal Processing volumeÂ 2024, ArticleÂ number:Â 48 (2024)
Abstract
In this study, we investigate the challenges associated with dynamic time slot server selection in mobile edge computing (MEC) systems. This study considers the fluctuating nature of user access at edge servers and the various factors that influence server workload, including offloading policies, offloading ratios, usersâ€™ transmission power, and the serversâ€™ reserved capacity. To streamline the process of selecting edge servers with an eye on longterm optimization, we cast the problem as a Markov Decision Process (MDP) and propose a Deep Reinforcement Learning (DRL)based algorithm as a solution. Our approach involves learning the selection strategy by analyzing the performance of server selections in previous iterations. Simulation outcomes show that our DRLbased algorithm surpasses benchmarks, delivering minimal average latency.
1 Introduction
Amidst the development of fifth generation (5G) networks and the rising popularity of mobile devices, applications such as artificial intelligence, big data processing, the burgeoning use of smartphones, augmented reality (AR) and natural language processing are generating an unprecedented volume of data streams [1,2,3]. These applications typically require substantial computational resources. Yet, mobile devices often have limited computing capabilities due to hardware constraints. Despite the advanced performance of contemporary devices, they fall short in addressing the needs of computationheavy applications and the demand for lowlatency, highreliability communication.
To tackle the issue of inadequate computing resources on mobile devices, several solutions have emerged, with Mobile Cloud Computing (MCC) [4,5,6,7] being among the first. Cloud servers boast vast computational resources, which can effectively compensate for the limited capacity of mobile devices to handle intensive computational tasks. However, the use of MCC comes with its own set of challenges. For instance, the network topology of MCC means that cloud servers are typically situated at a considerable distance from mobile devices, necessitating a reliance on the network for cloud connectivity. This can result in reduced efficiency and a degraded user experience due to factors such as mobile network bandwidth and latency. Furthermore, MCC poses certain security risks; data and applications on mobile devices often contain sensitive information, creating a potential for data breaches during transmission to the cloud.
To mitigate these issues, Mobile Edge Computing (MEC) has gained traction [8,9,10]. The fundamental concept of MEC is to decentralize computational resources from the core network to the networkâ€™s edge, by outfitting edge devices like base stations (BSs) with highspeed computational servers. This shift provides users with computationally intensive applications access to computational support in closer proximity. It is projected that, in the future, up to 75% of data generated by enterprises are likely to undergo processing at the edge of the network [11].
In MEC networks, application providers capitalize on the mobility of mobile devices to gather usersâ€™ preferences and location data by tracking the devicesâ€™ trajectories. This enables the providers to dynamically select the most suitable MEC server for each user, thereby decreasing both task processing latency and energy consumption. For instance, literature [12] introduces services that accommodate the random movement and task arrivals of multiple mobile terrestrial users by integrating unmanned aerial vehicles (UAVs) into the MEC framework. In literature [13], the systemâ€™s profitability is optimized by strategically managing the pricing of MEC computation services, the amount of data offloaded, and the selection of MEC servers while acknowledging the dynamic and unpredictable nature of user behavior. The study in [14] demonstrates how a UAVassisted MEC system can enhance overall stability and reduce both energy usage and computational delay by managing the UAVsâ€™ flight paths and finetuning the offloading ratio.
However, it is insufficient to only focus on user mobility; one must also account for the dynamic changes in the number of users accessing edge services. Most existing network architectures inaccurately assume a constant number of user accesses within their mathematical models [15,16,17,18,19]. In multiuser MEC systems, the complexity increases due to intricate resource competition and potential interactions among mobile terminals, on top of the natural limitations imposed by finite computational and communication resources. The arrival of tasks is unpredictable, occurring at various times and involving variable sizes depending on the application. This unpredictability, combined with the variety in task sizes, makes accurate forecasting a challenge. Furthermore, the dynamic nature of usersâ€™ tasks, influenced by emergency procedures, mobility, and the uncertain operational state of the mobile terminalincluding random tasks and transmission stateposes substantial challenges for effectively offloading applications to fully benefit from MEC. Emergency procedures are primarily influenced by the type of device, the userâ€™s sense of urgency or importance, and the critical nature of the computational task in emergency situations (e.g., myocardial infarction, heart attack, urgent applications, etc.). Users in such critical states should be given utmost priority, with their tasks being processed immediately. Hence, it is a significant challenge to incorporate the impact of user mobility, the stochastic nature of task arrivals, and the urgency of tasks into the modeling process.
Within edge computing networks, the offloading of tasks is categorized into two approaches: partial offloading and binary offloading, which depend on whether a computational task is divisible. In the binary offloading model, an indivisible task must be processed entirely either locally or on the MEC server [20,21,22,23,24,25]. With partial offloading, it is possible to offload portions of a task to the edge server by determining an optimal offloading ratio [26, 27]. Given the varying requirements of tasks, this study adopts the partial offloading approach. Given the variability of wireless channels, offloading all computational tasks may not always be beneficial. Conversely, opportunistic offloading that adapts to the fluctuating channel conditions can yield substantial performance improvements. The resulting challenge is that variations in backend application demands can alter the computational capacity available on edge servers. Therefore, developing a logical and efficient computational offloading strategy remains a formidable challenge.
Unlike cloud computing systems, edge servers typically possess constrained resources. Consequently, selecting the right edge server is a crucial component of the computation offloading process [28]. The research presented in literature [29] focuses on UAVassisted mobile edge computing with the objective of minimizing system latency by simultaneously optimizing UAV flight paths, time slot allocation, and compute resource distribution. Xing et al. introduced a computational offloading strategy in [30], aimed at reducing the userâ€™s offloading latency through the combined optimization of offloading duration and task processing time. However, given that the coverage of edge servers is finite and users may move frequently, mobile users may transition across various edge server coverage zones. Inappropriate server selection can lead to increased latency and energy consumption, thereby degrading the user experience. The population of users accessing edge servers is constantly fluctuating. As a user offloads tasks, it escalates the workload of the corresponding edge server, impacting the computational costs for all other users connected to that server. This interdependence among usersâ€™ choices complicates server selection further. The ongoing nature of user services, coupled with the dynamics introduced by user mobility, adds to the complexity of server selection. Thus, it becomes essential to estimate the longterm optimality of computation and communication expenses over a sequence of time slots within a dynamic setting.
Additionally, most existing research on timevariant challenges in MEC systems relies on conditions such as channel statistical data to precisely monitor and update networkwide channel information, which incurs significant signaling overheads, like the timeslotting strategy discussed in literature [31,32,33,34,35]. However, in edge environments, minimizing delay is imperative. The primary difficulty in multiuser edge systems is the allocation of limited communication and computational resources. Each endpoint generates tasks unpredictably, which, if not managed promptly, may cause network bottlenecks and queuing at the edge servers, ultimately diminishing system performance. To ensure timely task dispatch and execution among devices, our goal is to determine how to measure task state updates within the available time slots for mobile users with random movement and data arrival patterns to deliver computational services effectively.
In edge computing frameworks, system state transitions are primarily induced by elements such as the randomness of user engagement with the system, server workload fluctuations, and unpredictable task generation. These variables are not known beforehand, making it arduous to identify the optimal policy using conventional methods [36,37,38,39]. To navigate these challenges, we employ deep reinforcement learning (DRL) techniques. DRL is capable of proposing the most appropriate action by processing vast amounts of highdimensional raw data as input to the deep neural network, leveraging the deep neural networkâ€™s robust approximation capabilities. It is not necessary to foresee state transitions in advance, as DRL excels in managing control in stochastic and dynamic environments. Instead, it directly assesses mobile user dynamics based on observed outcomes, in accordance with the current system state, to facilitate server selection.
Based on the observations outlined previously, this study concentrates on the collaborative optimization of offloading decisions and resource allocation for task execution in MEC with the objective of minimizing the latency across the entire MEC system. The key contributions of this study are summarized as follows:

A mixed integer nonlinear programming model is presented to optimize task offloading and resource allocation decisions. We propose a time slot optimization scheme that accounts for a timevarying MEC system, characterized by dynamic and realtime changes. Mobile users initiate tasks with a certain probability that follows a uniform distribution. These tasks are unsynchronized, vary in size, and are generated by a constantly changing number of users. This study takes into account the stochastic nature of application requests from mobile users, as well as the unpredictable states of MTs, which include operational states and mobility patterns.

We facilitate dynamic optimization of joint resource allocation and task offloading decisions. Unlike most existing studies that are static and do not update resource allocation synchronously with the offloading decision, this study considers the offloading strategy for computational tasks, varying user priorities, and the resource demand of users with uncertain transmission power. We define the corresponding problem as a mixed integer nonlinear optimization challenge to simultaneously optimize the offloading decisions of mobile users and their access to the network, aiming to minimize the longterm latency of the whole MEC system. We model the userâ€™s server selection decision as a Markov decision process, considering short intratime slot resource optimization as well as longtime slot resource optimization. To address this, we propose a Deep Deterministic Policy Gradient (DDPG)based algorithm, which is designed to adapt to dynamically changing user conditions.

We conduct experimental simulations to assess the performance of our proposed algorithm and benchmark its effectiveness against existing algorithms.
In the following sections, we will delve into the system model, the proposed DRLbased server selection algorithm, and the results of our experimental simulations in detail.
2 Problem statement and formulation
2.1 System model
We consider a MEC system consisting of edge servers and mobile users, containing N users and K MEC servers as shown in Fig.Â 1. The set of users is denoted as \(\mathcal{N} = \left\{ {1,2,\ldots ,n} \right\}\). To provide computing services to the users, the MEC servers are deployed at the access points (APs). Furthermore, the time model is discrete. The length of each time slot is t, where \(t \in \left\{ {0,1,2,\ldots ,\tau } \right\}\). In each time slot, each user will generate only one computationally intensive task. In the system model, random arrival of tasks and realtime dynamic processing are used. The allocation of system spectrum and computational resources is uniformly scheduled by the MEC server.
Without loss of generality, we characterize the computational tasks arriving on the user i (\(i \in \mathcal{N}\)) in time slot t. are characterized \({N_{\max }}\) is the maximum number of users that the edge system can accommodate, with the number of access servers varying in each time slot, and denotes the task of the ith user at the moment of t, obeying a uniform distribution. Parameter elements \({U_i}\left( t \right) = \{ {S_i}\mathrm{{t}},{D_i}(t),{P_{i,\max }}(t),{\theta _{i,\max }}(t),{\lambda _i}(t)\}\) represent The characteristics of the user i, where \({S_i}\left( t \right)\) represents the data size of the computational task, and \({D_i}\left( t \right)\) reflects the resources required to accomplish the task, i.e., the total number of CPU cycles required. \({P_{i,\max }}(t)\) is the maximum transmit power of the user. \({\theta _{i,\max }}(t)\) denotes the maximum tolerable delay of the task. \({\lambda _i}\) is the priority of the user i, which is computed by the type of the device, and the degree of urgency/importance of the user. With a larger \({\lambda _i}\) denoting that the matter is more urgent (which can be categorized or prioritized thresholds, with a greater than the threshold of 1, and a smaller value being sorted by the weighted order).
The edge server k feature is denoted as parameter element \(\left\{ {{C_k}\left( t \right) ,C_k^r\left( t \right) } \right\}\), where \({C_k}\left( t \right)\) represents the server processing capability, which is constant as the basic parameter of the edge server. But at moment t, the computational capability that the edge server can provide to the user is variable and is denoted by \(C_k^r\left( t \right)\).
In the traditional time slot system scheme, the tasks generated by users in a certain time slot have to wait for all the usersâ€™ tasks to be processed before the resources are released together to process the tasks in the next time slot; and thus, the new tasks generated during this waiting period are in a waiting state. This greatly reduces the user experience and demand. To reduce the latency, a novel scheme is proposed for the time slot system. The new scheme is shown in Fig.Â 2. At the moment t1, user 1 and user 2 generate a new task Task1 and task2, respectively. On that basis, the proposed system is able to dynamically adjust the program of mobile user access server in realtime. When ending the task of the previous time slot, if a new task arrives (generated by user 3), the server immediately releases the resources (of the previous time slot to process user 1) to process the newly arrived task, thus satisfying the low latency requirements of the edge system users.
2.2 Communications model
In the edge server system model, since edge servers are densely deployed, the coverage areas of various edge servers often overlap with each other, and a mobile user can be covered by multiple edge servers exhibit the capability of covering a mobile user at the same time. When user i device handles its computing tasks locally, the processing time is determined by computing capability of the user, which differs from users.
In practical environments, the task characteristics and computational capabilities of an edge server may be timevarying due to the changing environment. To compute task \({A_i}(\mathrm{{t}})\), user i offloads a portion \({\rho _{i,k}}\) of the task \({A_i}(\mathrm{{t}})\) to the edge server k over a wireless link, where \(0 \le {\rho _{i,k}} \le 1\). All edge server k return the computation results to the user over dedicated feedback links. An offload vector of tasks for the user i is expressed as \({\rho _i} = [{\rho _{i,0}},{\rho _{i,1}},\ldots ,{\rho _{i,k}}]\), where \({\rho _{i,0}}\) is local computation and \({\rho _{i,k}}\) is the proportion of tasks offloaded by the user i on the edge server k.
The userâ€™s transmit power affects the transmission data rate, on that basis, the userâ€™s transmit power should be optimized. \({p_i} = [{p_{i,0}},{p_{i,1}},\ldots ,{p_{i,k}}]\) denotes the vector of the userâ€™s transmit power.
In this study, we study the problem of minimizing the delay in the communication and computation process to measure the system cost within each time slot. The uplink transmission rate from user i to the edge server k on the wireless link can be expressed as
where B is the bandwidth of the edge server channel, \({\sigma ^2}\) is the noise power, \({I_\mathrm{{i}}}(t)\) is the interference caused to the user i by other users in the channel, \(h_i^k(t)\) is the channel gain between the mobile user i and the edge server k at time slot t, and \({p_i}(t)\) is the uplink power of the user.
2.3 Calculation model
When tasks are offloaded from the user to the MEC, the complete task execution latency covers the communication latency between the user and the MEC as well as the computation latency at the MEC servers. Since each MEC server always handles other computational tasks simultaneously, the background workload may overload the MEC servers. Drawing upon the help of multiple MEC servers, users can select the associated MEC servers to minimize the computation delay. Thus, MEC server selection serves as a new dimension that reduces task execution latency and userâ€™s energy consumption. The computational tasks can be executed locally by the user or by computational offloading in a certain ratio \({\rho _{i,k}}\) on the MEC servers, and the latency is given as follows, respectively.
2.3.1 Locally computed delay
When user device i processes its computational tasks locally, the processing time is determined by its own computing capability, which is various for various users, the computational capability of the user is \(f_i^{\mathrm{{loc}}}(t)\), and the computational capability of the edge server is expressed as by \(f_k^{mec}(t)\). In each time slot the tasks are randomly generated, following the mean distribution \({A_j}\left( t \right) \sim U\left( \right)\).Subsequently the user local computation time can satisfy:
where \({D_i}\left( t \right)\) denotes the resources required to accomplish computational task \({S_i}\left( t \right)\).
2.3.2 Calculate offloading delay
When the user i offloads the computation task to the MEC server, the delay mainly consists of the uplink transmission time, MEC server task execution time, and the time for the output result to be transmitted from the MEC back to the user (which is negligible), then the uplink transmission time is
The total delay for the MEC to process the offloading task for user i is
where \({\xi _{i,k}}\left( t \right)\) represents the computing power allocated by the edge server k for the user i thereof.
For the entire edge system, the total delay for all users can be expressed as
2.4 Problem formulation
To simultaneously safeguard the task processing latency and computational cost, for the edge server collaborative computing system, the objective function is
where \({\lambda _i}\) denotes the priority of the user accessing the server, and the larger the value of \(\lambda\), the higher the priority of the user. \(f\left( {T_i^{mec,loc}\left( t \right) ,{\theta _{i,\max }}\left( t \right) } \right)\) is the reward function. Our proposed optimization problem can be completely expressed as
where \(\mathrm{{C}}1\) denotes the userâ€™s task offloading server selection, assuming that the userâ€™s task can only select one server. \(\mathrm{{C}}2\) denotes the offloading vector of user i. \(\mathrm{{C}}3\) ensures the constraint of uplink power, and \(\mathrm{{C}}4\) determines the computational resource allocation strategy. The optimization problem proposed in this study is a hybrid nonlinear programming challenge that is nonconvex and NPhard. To address this problem, we need to determine the offloading decision vector for each time slot, which encompasses the choice of server for offloading, the offloading ratio, and the userâ€™s transmission power, all aimed at minimizing the total delay cost of the system while adhering to a specified delay constraint.
It is important to note that the offloading decision variables \({a_{i,k}}\), \({\rho _{i,k}}\) and \({p_i}\) variables are dynamic. The system must gather information to ensure that offloading strategies and resource allocation decisions are informed by an overarching awareness of the network state. Furthermore, we explore a more realistic scenario where the pattern of task requests over time is not known in advance. Given the dynamic nature of the problem at hand, conventional optimization methods fail to deliver swift decisions in a constantly changing state, and the complexity of the algorithms scales up exponentially with the system modelâ€™s expansion. Hence, we propose a DRLbased method to tackle the problem presented in this study.
3 Approach design
In this section, we conceptualize the challenge of minimizing service delay as a Markov decision process (MDP). Initially, we define the state, action, and reward functions within the MDP framework. Subsequently, we employ the DDPG algorithm to resolve the problem.
3.1 Markov decision process model
For each discrete time slot t, the agent ascertains the presence of a new user and the generation of a new task. Upon the creation of a new task, the agent collects environmental information such as the allocatable computing capacity of the MEC node, the data being transmitted by users currently accessing the service, and the power and state of the environment. The agent then selects an action following the relevant strategy, interacts with the environment to acquire an updated state, and receives a reward signal generated by the environment. The agent iteratively refines its strategy in response to the reward, accumulating rewards after each action until the strategy stabilizes. Given that the agent must consider both immediate and future rewards, the principal learning objective is to maximize cumulative rewards through the continuous refinement of its strategy. In our model, one of the users is designated as the intelligent agent; while, all other components of the edge computing system constitute the environment. Below, we provide a detailed account of the state space, action space, and reward function.
3.1.1 State space
The state in MDP is a space reflecting the environment, encompassing user state and edge computing server state. The state space is represented by \(Z\left( t \right) = \left\{ {U\left( t \right) ,C\left( t \right) ,I\left( t \right) } \right\}\), where the user state is \(U\left( t \right) = \left\{ {{U_1}\left( t \right) ,{U_2}\left( t \right) , \cdots ,{U_n}{{\left( t \right) }_{{N_{\max }}}}} \right\}\), \({N_{\max }}\) represent the maximum number of users that the edge system can accommodate.
User state: The state characteristics of the ist user can be expressed as \({U_i}\left( t \right) = \{ {S_i}\left( t \right) ,{D_i}\left( t \right) ,{P_{i,\max }},{\theta _i}\left( t \right) ,{\lambda _i}\left( t \right) \}\), where \(0 < i \le {N_{\max }}\), \({S_i}\left( t \right)\) represent the data size of the computational task, \({D_i}\left( t \right)\) indicate the resources required to complete the task, \({P_{i,\max }}\) is the maximum transmit power of the user. \({\theta _i}(t)\) is the delay requirement, the maximum tolerable time of the task. \({\lambda _i}\) is the priority system of the user i which is determined by the type of the device, the degree of urgency/importance of the user, and the larger \({\lambda _i}\) the greater the urgency, the greater the urgency of the matter. For instance, when user i has no access or access but no new task is generated, then there is \({S_i}\left( t \right) = 0\), \({D_i}\left( t \right) = 0\), \({P_{i,\max }} = 0\), \({\theta _{i,\max }}(t) = 0\), \({\lambda _i}\left( t \right) = 0\).
The state characteristics of an edge computing server can be represented as \(C\left( t \right) = \left\{ {C_1^r\left( t \right) ,C_2^r\left( t \right) ,\ldots ,C_K^r\left( t \right) } \right\}\). \({C_k}\) denotes the computing capability of the edge server k, \(C_k^r\left( t \right)\) is the computing capability that the edge server k can provide to the user at time slot t. \(C_k^{used}\left( t \right)\) is the computing capability of the edge server k that has been assigned other tasks at time slot t, and \(C_k^r\left( t \right) = {C_k}  C_k^{used}\left( t \right)\).
Interference with other users when the environment user sends data is I(t).
3.1.2 Action space
An agent aims to choose the offloading tactics for various users throughout each time slot. The offloading strategy \(A\left( t \right) = \left\{ {X\left( t \right) ,\rho \left( t \right) ,P\left( t \right) ,\xi \left( t \right) } \right\}\) can be divided into four parts:

1.
\(X\left( t \right)\) indicates that the user task offload selection. Here, we assume that user iâ€™s task can only select one server, and
$$\begin{aligned} X\left( t \right) = \left( {\begin{array}{*{20}{c}} {{x_{1,1}}\left( t \right) }&{}{{x_{1,2}}\left( t \right) }&{} \cdots &{}{{x_{1,k}}\left( t \right) }\\ {{x_{2,1}}\left( t \right) }&{}{{x_{2,2}}\left( t \right) }&{} \cdots &{}{{x_{2,k}}\left( t \right) }\\ \vdots &{} \vdots &{} \cdots &{} \vdots \\ {{x_{{N_{mzx}},1}}\left( t \right) }&{}{{x_{{N_{mzx}},2}}\left( t \right) }&{} \cdots &{}{{x_{{N_{mzx}},K}}\left( t \right) } \end{array}} \right) , \end{aligned}$$(8)where \(\sum _{j = 1}^K {{x_{i,j}}\left( t \right) } = \left\{ {\begin{array}{ll} {1,}&{}{user\;i\;has\;a\;task\;and\;the\;task\;is\;offloaded\;to\;edge\;server\;j}\\ {0,}&{}{else} \end{array}} \right.\)

2.
\(\rho \left( t \right)\) indicates the percentage of user tasks offloaded. \(\rho \left( t \right) = \left( {{\rho _1}\left( t \right) ,{\rho _2}\left( t \right) , \cdots ,{\rho _{{N_{\max }}}}\left( t \right) } \right)\), \({\rho _i}\left( t \right) \in \left[ {0,1} \right]\) indicates the proportion of user iâ€™s data and computation tasks uploaded to the edge computing server. When \({\rho _i}\left( t \right) = 0\), indicates that user iâ€™s tasks are completed locally, and \({\rho _i}\left( t \right) = 1\) indicates that user iâ€™s tasks are completed locally.

3.
\(P\left( t \right)\) indicates the transmit power of the user. \(P\left( t \right) = \left( {{P_1}\left( t \right) ,{P_2}\left( t \right) , \cdots ,{P_{{N_{\max }}}}\left( t \right) } \right)\), \({P_i}\left( t \right) < = {P_{i,\max }}\) denotes the task transmit power for user i, and \({P_{i,\max }}\) denotes the maximum transmit power.

4.
The computational capability allocated by the edge server to user tasks can be represented by matrix \(\xi \left( t \right)\), i.e.,
$$\begin{aligned} \xi \left( t \right) = \left( {\begin{array}{cccc} {{\xi _{0,1}}\left( t \right) }&{}{{\xi _{0,2}}\left( t \right) }&{} \cdots &{}{{\xi _{0,K}}\left( t \right) }\\ {{\xi _{1,1}}\left( t \right) }&{}{{\xi _{1,2}}\left( t \right) }&{} \cdots &{}{{\xi _{1,K}}\left( t \right) }\\ \vdots &{} \vdots &{} \vdots &{} \vdots \\ {{\xi _{{N_{\max }},1}}\left( t \right) }&{}{{\xi _{{N_{\max }},2}}\left( t \right) }&{} \cdots &{}{{\xi _{{N_{\max }},K}}\left( t\right) } \end{array}} \right) . \end{aligned}$$(9)\(\xi \left( t \right)\) must fulfill the following conditions: a) \(C_j^r\left( t \right) = {\xi _{0,j}}\left( t \right) + \sum _{i = 1}^{i = {N_{\max }}} {{x_{i,j}}\left( t \right) } {\xi _{i,j}}\left( t \right)\), where \({\xi _{0,j}}\left( t \right)\) is very critical. It indicates the computational capability reserved by the edge server for future tasks which can be compared by simulation with or without reservation. For example, a set of data \({\xi _{0,j}}\left( t \right) = 0\) and another set of normal training. b) \(f_i^{Mec}\left( t \right) = \sum _{j = 1}^{j = K} {{x_{i,j}}\left( t \right) } {\xi _{i,j}}\left( t \right)\), \(f_i^{Mec}\left( t \right)\) denotes the computational capability obtained by user i.
3.1.3 Reward space
The reward function is pivotal as it delineates the overarching objective of the agentâ€™s learning journey. With each action completed, the agent garners a reward from the environment. This reward reflects the benefit of executing said action within the current state and, through sustained interaction, ultimately steers the agent toward refining its strategy to maximize cumulative gain. In light of the optimization challenge proposed, our aim is to minimize latency across the entire MEC system. Reinforcement learning endeavors to realize this by maximizing the sum of discounted rewards over time. As with any learning algorithm, during the training phase, once an action is taken, the corresponding reward is conveyed to the agent at time slot t. Based on the received reward, the agent updates its policy \((\pi )\) toward the optimal policythat is, the policy that consistently yields high rewards for actions taken across various environmental states. The reward issued to the agent is denoted by \(r:Z \times A \rightarrow R\).
In this study, we design the following reward function
When the reward function is \({r_i}\left( t \right)\), the system optimizes the objective function to minimize the service delay and increase the proportion of tasks that satisfy the delay qualification.
3.2 Deep reinforcement learning model design
anderror interactions with the environment, where state transitions and rewards are initially unknown. DRLbased server selection relies on gradientbased strategy learning. Within the context of this study, we need to ascertain whether longterm planning can be effectively executed in dynamic environments and how to manage highdimensional state spaces efficiently. Subsequently, we will outline the resolution to these challenges. For neural network training, we have utilized the DDPG algorithm. This deterministic policy framework does not produce the likelihood of an action; instead, it outputs the specific numerical value of each dimension that corresponds to the action, thereby obviating the need for action sampling. Given that the training data is timedependent, it can sometimes lead to slow convergence or even a lack of convergence in neural network training. To counteract this, we implement experience replay, a technique that disrupts temporal correlations to expedite convergence. In reinforcement learning, samples are sequentially correlated, presenting challenges, as neural networks function optimally with samples that are independent and identically distributed. Experience replay addresses the correlation issue inherent in sequential decisionmaking and enhances sample efficiency. Once the experience pool reaches a predetermined size, the oldest data is typically removed to ensure that the pool remains current. Algorithm 1 presents the proposed computational offloading algorithm for dynamic MEC networks based on deep reinforcement learning. The DDPG network structure is illustrated in Fig.Â 3.
The DDPG algorithm comprises four neural networks: the Actor network \({\mu _\theta }\left( t \right)\), the Critic network \({\mu _Q }\left( t \right)\), the Target Actor network \({\mu _{{\theta ^1}}}\left( t \right)\), and the Target Critic network \({\mu _{{Q ^1}}}\left( t \right)\). The workflow of the DDPG algorithm operates as follows:

1.
Initialization: The Actor and Critic networks are initialized along with their respective target networks.

2.
Sampling: The Actor network generates actions for a given environmental state, which are then executed in the environment to observe rewards and subsequent states.

3.
Storage: The experiences, consisting of states, actions, and rewards, are stored in a replay buffer for future learning.

4.
Training the Critic Network: A minibatch of experiences is randomly sampled from the replay buffer. The Critic network evaluates these experiences, the Temporal Difference (TD) error is computed, and the networkâ€™s parameters are updated through backpropagation to minimize this error.

5.
Training the Actor Network: The gradient of the error calculated by the Critic network is used to update the parameters of the Actor network via backpropagation.

6.
Updating the Target Networks: The parameters of the target networks are gradually adjusted toward the parameters of their respective current networks, using a soft update approach.

7.
Loop: Steps 2â€“6 are repeated, continuously refining the network parameters until the algorithm converges.
DDPG, being a deterministic policybased approach, requires sampling fewer data points, making the algorithm efficient. However, it may struggle with generalizing to unseen actions. To compensate for the action exploration ability sacrificed by the intelligent body, a random noise N is added to the selected action A at the strategy network to enhance the generalization. Ultimately, the expression for an action A that interacts with the environment is
where \(n\left( t \right)\) is Gaussian white noise.
Next is the loss function for DDPG. For the Critic current network, the loss function is the mean square error, i.e.,
In terms of the Actor current network, the loss function is
Building upon the DQN algorithm, the DDPG algorithm introduces three significant enhancements:
First, DDPG improves the stability of learning by adopting the dual neural network architecture from DQN. This architecture involves two sets of neural networksthe primary networks for evaluation and the target networks for occasional updates of the parameters. DDPG distinguishes itself by employing a soft update method for the target networks, providing a more stable learning process.
Second, to address the issue of correlated and nonuniformly distributed samples, DDPG utilizes the experience replay mechanism, a concept borrowed from DQN. This mechanism preserves the data generated during the agentâ€™s interaction with the environment in a structured memory known as the experience replay buffer. During the learning phase, the algorithm samples a batch of experiences at random from the buffer to train the model. This method ensures a diversified learning experience, which is essential for the robust development of the policy.
The third enhancement addresses the explorationexploitation dilemma, a fundamental challenge in reinforcement learning where the agent must balance the act of exploring new possibilities with leveraging existing knowledge. DDPG introduces exploration noise to this end. By adding stochastic noise, which often follows a Gaussian or uniform distribution, to the selected actions, the algorithm equips the agent with better exploration capabilities. This noise enables the agent to investigate uncharted areas of the state and action space more effectively, facilitating the discovery of optimal strategies.
4 Simulations and discussions
4.1 Simulation setup
In this section, extensive simulations are conducted to evaluate the performance of the proposed DDPG algorithm, and this algorithm is compared with benchmark algorithms.
A small cell with a radius of \(0.3\times 0.3\)Â km in a 5G mobile environment is considered, where there is K AP with MEC servers, and N mobile users with computation tasks exhibit random dispersion in the coverage area of the AP. We consider various users with various computational capabilities and the computing power exhibits a uniform distribution between 0.5 and 2Â GHz. The MEC system is capable of leveraging the DSA technique for the allocation of the channel resources according to the demand of the terminals. Other simulation parameters are listed in TableÂ 1.
4.2 Performance comparison
FigureÂ 4 illustrates the convergence of the proposed DDPGbased learning method when the system accommodates 20 user terminals. Initially, the cumulative reward experiences minimal fluctuation. This is attributed to the userâ€™s lack of environmental knowledge at the outset, resulting in nearly random action selection. As the user aggregates sufficient samples over time, these samples are used to train the network. Overall, the DDPGbased method demonstrates robust performance, stabilizing after approximately 50 training sessions. It is evident that with an increasing number of training sessions, the systemâ€™s cumulative reward swiftly escalates, enabling the effective learning of computational offloading strategies through ongoing interactions.
For a comparative analysis of performance, we introduce four benchmark algorithms: (a) A bruteforce search to ascertain an approximate optimal solution (denoted as â€śExhaustionâ€ť). (b) A strategy that prioritizes the offloading of tasks to MEC servers, distributing all communication and computation resources equally among users (denoted as â€śOffloadingâ€ť). (c) A usercentric approach that favors local task execution with maximum tolerated latency (denoted as â€śLocalâ€ť). (d) An optimization of offloading decisions that does not factor in the optimization of resource allocation (denoted as â€śOffloading Decisionâ€ť).
FigureÂ 5 presents a comparison of the proposed algorithmâ€™s performance against these benchmarks in terms of average latency with more users. The latency for all algorithms escalates with the addition of more users. The exhaustive method serves as a benchmark for peak performance. The proposed DDPGbased method delivers performance closely aligned with this exhaustive approach. Notably, with eight users, the proposed algorithm significantly diminishes average latency by 20%, 33% and 55% compared to the other three methods, respectively. Furthermore, the typical latency associated with the DDPG algorithm is also lower than those of the benchmark algorithms, suggesting the effectiveness of our proposed strategy.
The various tasks are categorized into three priorities based on the value of priority system \({\lambda _i}\). \(0.75 < {\lambda _i} \le 1\) for high priority, \(0 < {\lambda _i} \le 0.75\) for medium priority and \(0.1 < {\lambda _i} \le 0.4\) for low priority. The number of users is set to \(N = 20\) and the input data are fixed to an average value of 200Â kB.
FigureÂ 6 depicts the latency of three priority tasks under varying computational task loads. As the computational load intensifies, the latency for all priority levels increases, with the high priority tasks experiencing the least latency and the low priority tasks the most. The average system latency exceeds that of highpriority tasks, indicating that reducing latency for highpriority tasks incurs increased latency for lowerpriority ones. FigureÂ 7 presents the average task utility for the three levels of prioritized tasks under various computational burdens. Our proposed approach not only ensures reduced system latency but also stratifies task priority effectively, allowing urgent tasks to be completed more swiftly by users with pressing needs.
5 Conclusions
In this study, we address the server selection problem within dynamic time slot schemes in MEC. To tackle the NPhard challenges stemming from dynamic factors, we model the ongoing server selection issue as a MDP and introduce an algorithm based on DRL. Our DRLbased server selection algorithm accounts for user states, interuser interference, and the processing capabilities of edge servers. We incorporate historical data and the dynamic nature of these elements through neural network encoding. Our simulation results indicate that the DDPG algorithm, developed as part of this study, consistently outperforms established benchmarks by delivering the lowest average latency.
Availability of data and materials
Not applicable.
Abbreviations
 MEC:

Mobile edge computing
 MDP:

Markov Decision Process
 DRL:

Deep Reinforcement Learning
 5G:

Fifth generation
 BS:

Base station
 AR:

Augmented reality
 UAVs:

Unmanned aerial vehicles
 MINLP:

Mixed integer nonlinear programming
 DDPG:

Deep Deterministic Policy Gradient algorithm
 AP:

Access point
References
J. Liu, L. Zhong, J. Wickramasuriya, V. Vasudevan, uWave: Accelerometerbased personalized gesture recognition and its applications. Pervasive Mob. Comput. 5(6), 657â€“675 (2009)
A. AlShuwaili, O. Simeone, Energyefficient resource allocation for mobile edge computingbased augmented reality applications. IEEE Wirel. Commun. Lett. 6(3), 398â€“401 (2017)
W. Shi, J. Cao, Q. Zhang, Y. Li, L. Xu, Edge computing: vision and challenges. IEEE Internet Things J. 3(5), 637â€“646 (2016)
V. Farhadi et al., Service placement and request scheduling for dataintensive applications in edge clouds. In: Presented at the IEEE INFOCOM 2019â€”IEEE Conference on Computer Communications (2019)
L. Zhao, W. Sun, Y. Shi, J. Liu, Optimal placement of cloudlets for access delay minimization in SDNbased internet of things networks. IEEE Internet Things J. 5(2), 1334â€“1344 (2018)
B. Shen, X. Xu, F. Dar, L. Qi, X. Zhang, W. Dou, Dynamic task offloading with minority game for internet of vehicles in cloudedge computing. In: Presented at the 2020 IEEE International Conference on Web Services (ICWS) (2020)
H. Baraki, A. Jahl, S. Jakob, C. Schwarzbach, M. Fax, K. Geihs, Optimizing applications for mobile cloud computing through MOCCAA. J. Grid Comput. 17(4), 651â€“676 (2019)
Y.C. Hu, M. Patel, D. Sabella, N. Sprecher, V. Young, Mobile edge computingâ€”A key technology towards 5G. ETSI White Paper 11(11), 1â€“16 (2015)
Y. Mao, C. You, J. Zhang, K. Huang, K.B. Letaief, A survey on mobile edge computing: the communication perspective. IEEE Commun. Surv. Tutor. 19(4), 2322â€“2358 (2017)
Y. Zhang et al., Edge intelligence for plugin electrical vehicle charging service. IEEE Netw. 35(3), 81â€“87 (2021)
Varadharajan V, Mantri S, Shah B, et al. Emerging edge computing applications. In: IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT) (2022). p. 1â€“4
Z. Yang, S. Bi, Y.J.A. Zhang, Dynamic trajectory and offloading control of UAVenabled MEC under user mobility. In: Presented at the 2021 IEEE International Conference on Communications Workshops (ICC Workshops) (2021).
S. Li, X. Hu, Y. Du, Deep reinforcement learning and game theory for computation offloading in dynamic edge computing markets. IEEE Access 9, 121456â€“121466 (2021)
L. Zhang et al., Task offloading and trajectory control for UAVassisted mobile edge computing using deep reinforcement learning. IEEE Access 9, 53708â€“53719 (2021)
P.Q. Huang, Y. Wang, K. Wang, Z.Z. Liu, A bilevel optimization approach for joint offloading decision and resource allocation in cooperative mobile edge computing. IEEE Trans Cybern 50(10), 4228â€“4241 (2020)
T. Nisha, D.T. Nguyen, V.K. Bhargava, A bilevel programming framework for joint edge resource management and pricing. IEEE Internet Things J. 9(18), 17280â€“17291 (2022)
Y. Liu, J. Yan, X. Zhao, Deep reinforcement learning based latency minimization for mobile edge computing with virtualization in maritime UAV communication network. IEEE Trans. Veh. Technol. 71(4), 4225â€“4236 (2022)
X. Xu et al., Game theory for distributed IoV task offloading with fuzzy neural network in edge computing. IEEE Trans. Fuzzy Syst. 30(11), 4593â€“4604 (2022)
R. Zhao, J. Xia, Z. Zhao, S. Lai, L. Fan, D. Li, Green MEC networks design under UAV attack: a deep reinforcement learning approach. IEEE Trans. Green Commun. Netw. 5(3), 1248â€“1258 (2021)
S. Bi, Y.J. Zhang, Computation rate maximization for wireless powered mobileedge computing with binary computation offloading. IEEE Trans. Wirel. Commun. 17(6), 4177â€“4190 (2018)
F. Wang, J. Xu, X. Wang, S. Cui, Joint offloading and computing optimization in wireless powered mobileedge computing systems. IEEE Trans. Wirel. Commun. 17(3), 1784â€“1797 (2018)
C. You, K. Huang, H. Chae, Energy efficient mobile cloud computing powered by wireless energy transfer. IEEE J. Sel. Areas Commun. 34(5), 1757â€“1771 (2016)
W. Zhang, Y. Wen, K. Guan, D. Kilper, H. Luo, D.O. Wu, Energyoptimal mobile cloud computing under stochastic wireless channel. IEEE Trans. Wirel. Commun. 12(9), 4569â€“4581 (2013)
M.H. Chen, B. Liang, M. Dong, Joint offloading decision and resource allocation for multiuser multitask mobile cloud. In: 2016 IEEE International Conference on Communications (ICC) (2016), p. 1â€“6
T.Q. Thinh, J. Tang, Q.D. La, T.Q.S. Quek, Offloading in mobile edge computing: task allocation and computational frequency scaling. IEEE Trans. Commun. 65(8), 3571â€“3584 (2017)
C. You, K. Huang, H. Chae, B.H. Kim, Energyefficient resource allocation for mobileedge computation offloading. IEEE Trans. Wirel. Commun. 16(3), 1397â€“1411 (2017)
Y. Wang, M. Sheng, X. Wang, L. Wang, J. Li, Mobileedge computing: partial computation offloading using dynamic voltage scaling. IEEE Trans. Commun. 64(10), 4268â€“4282 (2016)
T. Liu, S. Ni, X. Li, Y. Zhu, L. Kong, Y. Yang, Deep reinforcement learning based approach for online service placement and computation resource allocation in edge computing. IEEE Trans. Mob. Comput. 22(7), 3870â€“3881 (2023)
S. Joo, H. Kang, J. Kang, CoSMoS: Cooperative skyground mobile edge computing system. IEEE Trans. Veh. Technol. 70(8), 8373â€“8377 (2021)
H. Xing, L. Liu, J. Xu, A. Nallanathan, Joint task assignment and resource allocation for D2Denabled mobileedge computing. IEEE Trans. Commun. 67(6), 4193â€“4207 (2019)
X. Zhu, Y. Luo, A. Liu, N.N. Xiong, M. Dong, S. Zhang, A deep reinforcement learningbased resource management game in vehicular edge computing. IEEE Trans. Intell. Transp. Syst. 23(3), 2422â€“2433 (2022)
L. He, J. Zhao, X. Sun, D. Zhang, Dynamic task offloading for mobile edge computing in urban rail transit. In Presented at the 2021 13th International Conference on Wireless Communications and Signal Processing (WCSP) (2021)
N. Irtija, I. Anagnostopoulos, G. Zervakis, E.E. Tsiropoulou, H. Amrouch, J. Henkel, Energy efficient edge computing enabled by satisfaction games and approximate computing. IEEE Trans. Green Commun. Netw. 6(1), 281â€“294 (2022)
Y. Zou, F. Shen, F. Yan, L. Tang, Taskoriented resource allocation for mobile edge computing with multiagent reinforcement learning. In Presented at the 2021 IEEE 94th Vehicular Technology Conference (VTC2021Fall) (2021)
Q. Li, X. Ma, A. Zhou, X. Luo, F. Yang, S. Wang, Useroriented edge node grouping in mobile edge computing. IEEE Trans. Mob. Comput. 22(6), 3691â€“3705 (2023)
Y. Zhang, X. Dong, Y. Zhao, Decentralized computation offloading over wirelesspowered mobileedge computing networks. In: IEEE International Conference on Artificial Intelligence and Information Systems (ICAIIS) (2020). pp. 137â€“140
P. Zhou, B. Yang, C. Chen: Joint computation offloading and resource allocation for NOMAenabled industrial internet of things. In: 39th Chinese Control Conference (CCC) (2020). pp. 5241â€“5246
Z. Song, Y. Liu, X. Sun, Joint task offloading and resource allocation for NOMAenabled multiaccess mobile edge computing. IEEE Trans. Commun. 69(3), 1548â€“1564 (2021)
Z. Wan, D. Xu, D. Xu et al., Joint computation offloading and resource allocation for NOMAbased multiaccess mobile edge computing systems. Comput. Netw. 196, 108256 (2021)
N.C. Luong, D.T. Hoang, S. Gong, D. Niyato, I.K. Dong, Applications of deep reinforcement learning in communications and networking: A survey. IEEE Commun. Surv. Tuts 21(4), 3133â€“3174 (2019)
M. Chen, Y. Hao, Task offloading for mobile edge computing in software defined ultradense network. IEEE J. Sel. Areas Commun. 36(3), 587â€“597 (2018)
H. Zhou, K. Jiang, X. Liu, X. Li, V.C.M. Leung, Deep reinforcement learning for energyefficient computation offloading in mobileedge computing. IEEE Internet Things J. 9(2), 1517â€“1530 (2022)
Acknowledgements
This research was funded by the Fujian Provincial Natural Science Fund under Grant (2023J01967)
Additional files
This is as a reference to check the layout of the article as the authors intended.
Funding
This research was funded by the Fujian Provincial Natural Science Fund under Grant (2023J01967).
Author information
Authors and Affiliations
Contributions
The major writer of this study is Y.F., who put forward the main idea, carried out the simulations, and analyzed it. X.C. assisted the review of this study. All authors read and issued the approval of the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Fan, Y., Cai, X. A deep reinforcement approach for computation offloading in MEC dynamic networks. EURASIP J. Adv. Signal Process. 2024, 48 (2024). https://doi.org/10.1186/s13634024011422
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13634024011422