This section mainly introduces the proposed DDQN algorithm-based computing offloading policy, pre-processing stage, and the DQN algorithm-based computational resource allocation scheme, respectively.

### 4.1 DDQN algorithm-based computing offloading

In the first phase, the utility obtained by each UE cannot be computed directly in the FAP processing mode because the computational resource in each FAP has not been allocated to UE before making the offloading decisions. Hence, we regard all the computational resource as a whole entirety in each FAP that can be allocated to the requested UE in the first phase, and after making the offloading decisions for each UE, we further optimize the computational resource in each FAP. Specifically, a DDQN algorithm is adopted to find the most appropriate offloading mode for each UE. After choosing the appropriate offloading mode, we will continue to decide which tasks should be sent to the cloud center while allocating optimal the computational resource to the FAPs.

#### 4.1.1 Markov decision process

The DDQN algorithm is based on the DRL algorithm, which as a model-free approach can address complicated system settings by dynamically interacting with an unknown environment without any prior knowledge [25]. Meanwhile, DRL also can handle the potentially large state space problem [26]. In our considered F-RAN system, the problem of making offloading decisions for UEs is formulated as a finite Markov Decision Process (MDP) [27]. In our considered F-RAN system, assuming that the time period is divided into total *T* steps in each training epoch, and \(t=\left( 1,2,3,..,T \right)\) indicates each step, the parameter *T* denotes the number of UEs that need to offload tasks. Combining the considered F-RAN system and the DRL algorithm, the four essentials in RL presented as Agent, Action Space, Environment & State, and Immediate Reward, respectively, in each step *t* are defined as

**Agent:** The agent is defined as a learner and a decision-maker in RL. Thereby, in our considered F-RAN system, BS is selected as the agent of the DDQN algorithm.

**Environment & State:** The environment in RL is defined as the set of all possible states, and the essence of RL is to perform actions to cause the state transfer [28]. Therefore, we set a matrix \(\varvec{S}\) as the state, which has the same shape as the matrix \(\varvec{P}\), and the value of \({{s}_{ij}}\) in the matrix \(\varvec{S}\) should only be 0 or 1, \({{s}_{ij}}=1\) represents the agent selects FAP *i* (or DCN *i*) for UE *j*, otherwise \({{s}_{ij}}=0\). At step \(t=0\) in each training epoch, we initialize the matrix \(\varvec{S}\) as a total zero matrix, then, the agent executes actions to interact with the environment to trigger the change of matrix \(\varvec{S}\).

**Action Space:** The BS make the offloading decision for each requested UE according to the network topology matrix \(\varvec{P}\), and the optimization object is to find the optimal offloading mode for each UE. Thereby, in the proposed DDQN algorithm, we use \({{a}_{t}}\in A\) to denote the action in the step *t*, where \(A=\left\{ {{\partial }_{1}},{{\partial }_{2}},...,{{\partial }_{M}},{{\beta }_{1}},{{\beta }_{2}},...,{{\beta }_{K}} \right\}\).

**Immediate Reward:** The settings of the reward function always need to be related to the objective function [29]. Accordingly, we set the immediate reward \({{r}_{t}}\) in each step *t* as two parts: If the constraints in Eq. (18) can be all satisfied, the agent will obtain a positive immediate reward \({{r}_{t}}\) represented as the utility obtained by the *t*-th UE. Otherwise, the reward obtained by the agent is zero. In addition, there exists another situation that the reward is set to be zero, that is \(\exists j\in N,\sum _{i=0}^{M+K+1}{{{p}_{ij}}=0}\), which means the UE *j* cannot be connected with any FAP or DCN. Therefore, when the reward is zero, it means the UE should carry out local processing. We define the reward function at step *t* as

$$\begin{aligned} {{r}_{t}}=\left\{ \begin{aligned}&\text{the utility of }t\text{-UE, if (18a)-(18e) is satisfied} \\&0\text{, (18a)-(18e) is not all satisfied or }\exists j\in N,\sum \limits _{i=0}^{M+K+1}{{{p}_{ij}}=0}\text{ } \\ \end{aligned} \right. \text{ } \end{aligned}.$$

(19)

At the end of each training epoch, the accumulated reward is represented as the total utility that the requested UEs.

#### 4.1.2 The pre-processing stage

However, the proposed centralized DDQN algorithm in the BS always has a higher algorithm complexity, which is interpreted as the dimensions of the state space in the DDQN algorithm will increase dramatically as the number of the requested UEs increases, which increases the complexity of the DDQN algorithm while decreasing the efficiency of the network training. Thereby, a pre-processing phase is adopted to decrease the dimensions of the state space to improve the total utility obtained by all UEs. Specifically, assume that each UE, DCN, and FAP has cached some processing results of different tasks based on the optimal caching matrix \({{C}_{(M+K+N)\times N}}\) come from our previous research [20]. Combined with our considered F-RAN system, we extend the dimension of the matrix \({{P}_{(M+K)\times N}}\) to the same dimension as *C* and fill in “1” where they are extended. Then, dot multiplies the matrix *C* and obtains a matrix \({P}'=P\bullet C\). In this way, each task has its own identity to be distinguished from others in \({P}'\). Accordingly, when a UE has a task to be processed, it will first check whether the task result has been cached on its local cache. If the result has not been found locally, the identification of the task will be transmitted to BS, and the BS will search the matrix \({P}'\) then select the closest route to delivery to the requested UE. If the result can be directly obtained in the pre-processing stage, the maximum utility that UE *n* can obtain is expressed as

$$\begin{aligned} {{U}_{n}}=\rho _{n}^{t}T_{n}^{l}+\rho _{n}^{e}E_{n}^{l}. \end{aligned}$$

(20)

However, if the result cannot be found in the pre-processing stage, the offloading procedure will be adopted. Since the BS server is equipped with a powerful computing server, the searching and delivery of the task result can be completed so fast that the delay to transmit can be ignored. Therefore, UEs who no longer need to participate in the task offloading can find their task results during the pre-processing phase. In this way, the complexity of the DDQN algorithm can be decreased. The specific algorithm procedure in the pre-processing phase is shown in Algorithm 1.

#### 4.1.3 DDQN algorithm

The DDQN algorithm-based offloading scheme is proposed to select the optimal offloading mode for UEs who need to be offloaded after the state space has been decreased. Specifically, the DDQN algorithm is a typical DRL algorithm that utilizes the deep neural network to approximate the state-action *Q* value with the aim of maximizing the expected accumulated discounted reward and get the optimal action [30]. The *Q* function is expressed as formula (20)

$$\begin{aligned} Q(s,a)=E\left[ \sum \limits _{i=0}^{T}{{{\gamma }^{i}}{{R}_{t+i}}}\left| {{s}_{t}}=s,{{a}_{t}}=a \right. \right] \end{aligned}$$

(21)

where

$$\begin{aligned} {{R}_{t}}={{r}_{t}}+{{r}_{t+1}}+\cdots +{{r}_{T}}. \end{aligned}$$

(22)

And \(\gamma\) is a discount factor between 0 and 1 that stands for the effect of the future timestamp rewards on current time-step rewards. The greater effect makes a bigger \(\gamma\).

The model and architecture of the DDQN algorithm we designed is shown in Fig. 2, where we use each step *t* at a training epoch as an instance to introduce our network model. In each step *t*, the input of the DDQN network is the current state \({{s}_{t}}\) and the output is the *Q* value of each possible action at the state \({{s}_{t}}\), which can be presented as \(Q({{s}_{t}},{{a}_{t}})\). The agent selects an action according to the \(\varepsilon\)-Greedy policy then perform the action, which is interpreted as an action is randomly selected with the probability of \(\varepsilon\) and the action that has the maximum value of \(Q({{s}_{t}},{{a}_{t}})\) is selected with the probability of \(1-\varepsilon\). The advantage of using this \(\varepsilon\)-Greedy policy is that it can make the agent explores the unknown action and state in each step so as to avoid the algorithm falling into a locally optimal solution. After selecting an action to execute, the state will transfer to the next state \({{s}_{t+1}}\). Meanwhile, the agent also gets an immediate reward represented as\({{r}_{t}}\), and the network will carry on the training at the next step \(t+1\) until the end of the training epoch. During the training process, the object of the DDQN network training is to obtain a series of actions that can achieve the maximized accumulated discounted reward. This can be interpreted as the BS aims at achieving the maximum total utility for all UEs in the considered F-RAN model. To achieve a better performance of the network training, the DDQN algorithm splits the output \(Q({{s}_{t}},{{a}_{t}})\) into two different parts, which is the State Value Function \(V({{s}_{t}})\) and Action Advantage Function \(A({{s}_{t}},{{a}_{t}})\) individually expressed as

$$\begin{aligned} Q({{s}_{t}},{{a}_{t}},\omega ,\varphi )=V({{s}_{t}};\omega )+A({{s}_{t}},{{a}_{t}};\varphi ) \end{aligned}$$

(23)

where \(\omega\) and \(\varphi\) are the network parameters for \(V({{s}_{t}})\) and \(A({{s}_{t}},{{a}_{t}})\), respectively. Specifically, \(V({{s}_{t}})\) stands for the excepted accumulated reward at the state \({{s}_{t}}\), and \(A({{s}_{t}},{{a}_{t}})\) indicates the degree of superiority of action \({{a}_{t}}\) over the average level in state \({{s}_{t}}\) presented as formula (24) and (25).

$$\begin{aligned} V(s)= & {} E\left[ \sum \limits _{i=0}^{T}{{{\gamma }^{i}}{{R}_{t+i}}}\left| {{s}_{t}}=s \right. \right] \end{aligned}$$

(24)

$$\begin{aligned} A({{s}_{t}},{{a}_{t}})\triangleq & {} Q({{s}_{t}},{{a}_{t}})-V({{s}_{t}}). \end{aligned}$$

(25)

According to Ref. [31], formula (23) can reformulated as

$$\begin{aligned} Q({{s}_{t}},{{a}_{t}};\omega ,\varphi )=V({{s}_{t}};\omega )+(A({{s}_{t}},{{a}_{t}};\varphi )-\frac{1}{\left| A \right| }\sum \limits _{a}{A({{s}_{t}},{{a}_{t}};\varphi ))}. \end{aligned}$$

(26)

Furthermore, according to the training procedure of DRL [32], we build the loss function of the DDQN algorithm as

$$\begin{aligned} L={{({{r}_{t}}+\gamma \underset{a}{\mathop {\max }}\,{\hat{Q}}({{s}_{t+1}},a,{{\omega }^{-}},{{\varphi }^{-}})-Q({{s}_{t}},{{a}_{t}};\omega ,\varphi ))}^{2}} \end{aligned}$$

(27)

where \({{r}_{t}}+\gamma \underset{a}{\mathop {\max }}\,{\hat{Q}}({{s}_{t+1}},a,{{\omega }^{-}},{{\varphi }^{-}})\) represents the target network and \(Q({{s}_{t}},{{a}_{t}};\omega ,\varphi )\) represents the predict network value. Actually, these two networks have the same structure but different parameters, where the parameters of the former are copied from the latter every *I* steps. During each training epoch of the DDQN network, the gradient descent algorithm is utilized to minimize the loss function to find the optimal parameters of the predict network, which is further used to evaluate the *Q* value of each chosen action [33]. In the DDQN algorithm, an experience pool is introduced to ensure the stability of the network training, where the specific approach is to put the latest interaction data \(({{s}_{t}},{{a}_{t}},{{r}_{t}},{{s}_{t+1}})\) into an experience memory pool, when the training is start, a mini-batch \(({{{s}'}_{t}},{{{a}'}_{t+1}},{{{r}'}_{t+1}},{{{s}'}_{t+1}})\) will be randomly sampled from the pool. As a result, the experience replay mechanism not only makes the agent learn from the previous experiences repeatedly but also removes the correlations between the observations. Thereby, the DDQN network training will become more stable and more efficient. The whole procedure of the above proposed DDQN algorithm is drawn in Fig. 3, and the proposed DDQN algorithm-based offloading scheme is presented in Algorithm 2.

### 4.2 DQN algorithm-based computation resource allocation scheme

Since multiple UEs connected to the same FAP will cause resource competition, some of the tasks in FAP should be relayed to the cloud server to ensure the maximization of total utility. Meanwhile, the computational resource in each FAP should be allocated to the UEs whose task has offloaded to the corresponded FAP. In this part, we first classify the tasks in each FAP into two different parts according to UE’s different requirements in latency which is characterized with the delay revenue coefficient \(\rho _{n}^{t}\). Specifically, tasks with higher delay requirement that are represented as \(\rho _{n}^{t}\ge 0.5\) are set to be remain at FAP to process. Otherwise, when \(\rho _{n}^{t}<0.5\), the tasks will be sent to the cloud to process. Since the cloud center has abundant computational resources and owns powerful processing capability, while the computational resource of FAPs is limited. Thereby, we assume that the tasks sent to the cloud center can be processed in parallel [34]. Meanwhile, a distributed DQN algorithm is adopted to optimize the resource allocation in each FAP.

DQN algorithm is also a typical model-free DQL [35], so the computational resource allocation problem can be formulated as MDP as well, the Agent, State, Action, and Reward are described as follows.

**Agent:** In the proposed distributed DQN algorithm, since the object is to optimize the computational resource in each FAP, we define the Agent as each FAP.

**Environment** & **State:** The state is defined as a combination of the available resources in each FAP and the obtained utility of UE in each FAP, which can be expressed as \(s=({{F}_{m}},\sum _{i=1}^{{{N}_{m}}}{{{U}_{i}}})\), where \({{N}_{m}}\) stands for the number of UEs who offload their tasks to FAP *m*.

**Action Space:** The action should contain all possible schemes of resource allocation to the UEs who remain at the FAP *m*. Besides, the DQN algorithm is mainly oriented to the problem with discrete actions. Thereby, the computational resource in each FAP should also be discrete, and the discrete computational resource blocks should be allocated to each UE. Supposed the computational resource in FAP *m* is divided equally into *X* parts. Therefore, the action is expressed as \({{a}_{t}}=({{f}_{1}},{{f}_{2}},...,{{f}_{i}},...,{{f}_{{{N}_{m}}}}),{{f}_{i}}\in \{1,2,3,...,X\}\), where \({{f}_{i}}\) denotes the number of computational resource block which is allocated to the UE *i*.

**Immediate Reward:** Since the agent act as each FAP in this distributed DQN-based resource allocation problem, so FAP *m* will immediately get a positive reward denoted as the utility of UEs in FAP *m*, which is expressed as \(\sum _{i}^{{{N}_{m}}}{{{U}_{i}}}\). In practice, if the variable range of reward value does not exceed a threshold quantity which is represented as a small value in ten consecutive time steps in the training epoch, we set this training epoch is terminated, and the network will be start at the next training epoch.

As shown in Fig. 4, the input of the DQN is the state \({{s}_{t}}\) in each step *t*, then three fully connected layers are utilized to extract the features of the input data, finally, the output of the DQN is the resource allocation vector. When the DQN algorithm tends to converge, the agent can eventually learn the optimal resource allocation vector \(({{f}_{1}}^{*},{{f}_{2}}^{*},...,{{f}_{i}}^{*},...,{{f}_{{{N}_{m}}}}^{*})\).

Similarly, the DQN algorithm uses the gradient descent algorithm to update the *Q*-network during each training epoch to minimize the loss function, which is formulated as

$$\begin{aligned} {{L}_{t}}(\theta )=E\left[ {{\left( \underbrace{\left( {{r}_{t}}+\gamma \underset{{{a}_{t+1}}}{\mathop {\max }}\,Q\left( {{s}_{t+1}},{{a}_{t+1}},{\theta }' \right) \right) }_{\text{Target}}-Q\left( {{s}_{t}},{{a}_{t}},\theta \right) \right) }^{2}} \right] \end{aligned}$$

(28)

where \({\theta }'\) represents the parameter of the target network which is copied from the predict network parameter \(\theta\) every several steps. As with the DDQN algorithm, the DQN algorithm also adopts the experience replay mechanism to remove the correlation of the data to make the training of the network more stable. The proposed DQN algorithm-based computational resource allocation is illustrated in Algorithm 3.