2.1 System model
This paper researches on the resource allocation issue of an uplink multiuser NOMA system, where the base station (BS) is located in the center of the cell, and the users are randomly distributed near the base station. The total system bandwidth B is equally divided among S subchannels, and the users in the same subchannel are nonorthogonal. Assume there are U users and S subchannels in the system, and the maximum power transmitted by the base station is P_{max}. The signal transmitted on subchannel s is,
$$x_{s} (t) = \sum\limits_{u = 1}^{U} {b_{s,u} {(}t{)}\sqrt {p_{s,u} (t)} } x_{s,u} (t)$$
(1)
where \(x_{s,u} (t)\) and \(p_{s,u} (t)\) represent the data signal and allocated power of user u on subchannel s, respectively. \(b_{s,u} {(}t{) = 1}\) indicates that subchannel s is allocated to user u, and vice versa. The received signal can be expressed as,
$$y_{s,u} (t) = b_{s,u} {(}t{)}h_{s,u} (t)\sum\limits_{u = 1}^{U} {\sqrt {p_{s.u} (t)} } x_{s,u} (t) + \sum\limits_{q = 1,q \ne u}^{U} {b_{s,q} {(}t{)}\sqrt {p_{s,q} (t)} } x_{s,q} (t) + z_{s,u} (t)$$
(2)
where \(h_{s,u} (t) = g_{s,u} PL^{  1} {(}d_{s,u} {)}\) denotes the channel gain between the base station and user u on subchannel s. Assume that \(g_{s,u}\) is Rayleigh fading channel gain [26], \(PL^{  1} {(}d_{s,u} {)}\) is the path loss, and d_{s,u} is the distance between user u and base station on channel s. \(z_{s,u} (t)\) represents additive white Gaussian noise which follows the complex Gaussian distribution, i.e. \(z_{s,u} (t)\sim {\text{CN}} {(}0,\sigma_{n}^{2} {)}\).
In the NOMA system, due to the interference introduced by superimposed users, successive interference cancellation (SIC) technique is required to eliminate interference at the receiver. Firstly, the receiver decodes the users with high power levels, then subtracts it from the mixed signal, repeats this process until the desired signal has the maximum power in the superimposed signal, and regards the rest as interference signals. As a result, the signal to interference plus noise ratio (SINR) can be described as,
$${\text{SINR}}(t) = {\kern 1pt} {\kern 1pt} {\kern 1pt} \frac{{b_{s,u} (t)p_{s,u} (t){}h_{s,u} (t)^{{2}} }}{{\sum\nolimits_{{u = 1,{}h_{s,q} (t){}^{{2}} { < }h_{s,u} (t)^{{2}} }}^{U} {b_{s,q} (t)p_{s,q} (t){}h_{s,q} (t){}^{{2}} + \sigma_{n}^{2} } }}$$
(3)
The data rate of user u on subcarrier s is defined as,
$$R_{s,u} {(}t{)} = \frac{B}{S}{\text{log}}_{2} {(}1 + {\text{SINR(}}t{))}$$
(4)
The user sum rate is,
$$R = \sum\limits_{u = 1}^{U} {R_{s,u} } (t)$$
(5)
The optimization objectives and constraints of the joint user grouping and power allocation problem are given as follows,
$$\begin{aligned} & \text{P}1: \text{max} R\\ & \text{C}1: 0 \le \sum\limits_{s = 1}^{S} {p_{s,u} (t)} \le P_{\max } ,\;s \in S,u \in U\\ & \text{C}2: b_{s,u} (t) \in \{0,1\},\;s \in S,u \in U\\ & \text{C}3: \sum\limits_{s = 1}^{\text{S}} {b_{s,u} (t)} \le 1,\;s \in S,u \in U\\ & \text{C}4: \sum\limits_{u = 1}^{U} {b_{s,u} (t)} \le C,\;s \in S,u \in U \\ \end{aligned}$$
(6)
In the above constraints, C1 indicates that the power allocated to each user should be less than the maximum power transmitted by the base station. C3 and C4 indicate that multiple users can be placed on one subchannel. Because this objective function is a NonConvex optimization problem, it is difficult to find the global optimal solution. Although the global search method can find the optimal solution by searching all the grouping possibilities, the computational complexity is too high to apply in practice. Therefore, a DRLbased method is proposed for user grouping and power allocation in the NOMA system.
2.2 NOMA resource allocation based on DRL network
In this section, a NOMA resource allocation network based on DRL network is proposed. The description of the system structure is given in the following subsections.
2.2.1 System structure
The system structure is shown in Fig. 1. Figure 1a is a general reinforcement learning network structure. The general reinforcement learning is mainly divided into the following five parts: agent, environment, state s_{t}, action a_{t}, and immediate reward r_{t}. The learning process of reinforcement learning can be described as follows: the agent obtains the state s_{t} from the environment, and then selects an action a_{t} from the action space and feeds it back to the environment. At this time, the environment generates a reward r_{t}, which is generated by choosing this action a_{t} in the current state s_{t}, and also generates state s_{t+1} of the next time slot. Then the environment gives them back to the agent. The agent stores learning experience in the experience replay pool to facilitate learning in the next time slot.
According to the structure of reinforcement learning, the system model designed in this paper is shown in Fig. 1b. Specifically, the NOMA system represents the environment of reinforcement learning. There are two agents, the Prioritized DQN user grouping network represents agent 1, and the DDPG power allocation network represents agent 2. We use channel gain as a characterization of the environment. Accordingly, the state space can be expressed as \(S = \{h_{1,1} (\text{t}),\;h_{2,1} (\text{t}), \ldots ,h_{s,u} (\text{t})\}\), the user group space can be expressed as \(A1 = \{ b_{1,1} (t),b_{2,1} (t), \ldots ,b_{s,u} (t)\}\), power allocation space are \(A2 = \{ p_{1,1} (t),\;p_{2,1} (t), \ldots ,p_{s,u} (t)\}\). Besides, the immediate reward is denoted as \(r_{t} = R\), where R is the system sum rate defined in (5). Our goal is to maximize longterm rewards, which is expressed as,
$$R_{t} = r_{t} + \gamma r_{t + 1} + \gamma^{2} r_{t + 2} + \cdots {\kern 1pt} = \sum\limits_{i = 0}^{\infty } {\gamma^{i} } r_{t + i} \;,\;\;\;\gamma \in {[}0,1{]}$$
(7)
where γ is the fading factor. When \(\gamma { = }0\), it means that the agent only pays attention to the reward generated in the current state; when \(\gamma \ne 0\), it means that the agent also pays attention to future reward, and future rewards take more weight as γ increases.
The expected value of cumulative return R_{t} (obtained by (7)) of general reinforcement learning is defined as the Q value, which is determined by state s_{t}, and the selection of action a_{t} under a certain strategy \(\pi\). It is expressed as,
$$Q_{\pi } (s_{t} ,a_{t} ) = {\text{E}}[r_{t} + \gamma \max Q_{\pi } (s_{t + 1} ,a_{t + 1} )s_{t} ,a_{t} ]$$
(8)
In summary, in each time slot (TS), the agent obtains the channel gain from the NOMA system, selects user combination and power in the action space according to current channel gain, and gives the action (optimal user groups and power) result back to the NOMA system. According to the received action, the NOMA system generates immediate reward and the channel gain of the next time slot, and then passes them to the agent. Based on the reward, the agent updates the decision function of selecting this action under the current channel gain, which completes an interaction. Repeat this process until the agent can generate an optimal decision under any channel gain. The specific design of the Prioritized DQN and DDPG network in Fig. 1b is illustrated in Fig. 2, and the detail description of them will be given in the following subsections.
2.2.2 User grouping based on prioritized DQN
In this article, we use the Prioritized DQN to perform user grouping, which is an improved network of DQN. The DQN includes two networks, the Q network generates the estimated Q value, and the target Q network generates the target Q values used to train the Q network parameters. The two networks are identical in structure but different in parameters. The Q network is trained with the latest parameters, while the parameters of the target Q network are copied from the Q network at intervals. The main idea of the DQN algorithm is to continuously adjust the network weight by optimizing the loss function produced by the estimated Q value and the target Q value. Moreover, experience replay is used in the DQN to reduce the correlation between samples. In DQN, all the samples are uniformly sampled from the experience replay pool. In this case, some important samples may be neglected, which will reduce the learning efficiency. In order to make up for the shortcomings of the random sampling from experience pool, a reinforcement learning method based on prioritized experience replay is proposed, which mainly solves the sampling problem in experience replay [27]. The main idea is to set priorities for different samples to increase the sampling probability of valuable samples. In this paper, we use Prioritized DQN to perform user grouping. In order to better understand the algorithm, we first introduce the prioritized experience replay knowledge.

1.
Prioritized experience replay
Temporaldifference error (TDerror) indicates the difference between the output action value and the estimated value. TDerrors produced by different samples are different, and their effects on backpropagation are also different. A sample with large TD error indicates that there is a big gap between the current value and the target value, which means that the sample needs to be learned and trained. Therefore, in order to measure the importance of the sample, we use TDerror to represent the priority of sample, which can be expressed as,
$$\delta_{i} = {}y_{i}  Q(s_{i} ,a_{i} ;\omega ){ + }\psi$$
(9)
where δ_{i} is the TDerror of sample i, \(\psi\) is a very small constant to ensure that samples with a priority of 0 can be selected, y_{i} is the target value defined in (14).
By setting priorities for samples, samples with large TDerrors may be sampled with high probabilities, and they will join the learning process more frequently. In contrast, samples with small TDerrors may not be replayed at all, because the TDerrors of them cannot be updated every time and are always small. In this case, the diversity of samples will be lost and result in overfitting. It is necessary to ensure that the sample with low priority can be sampled with a certain probability. Therefore, a probability of occurrence is defined as [23],
$$P(i) = \frac{{\delta_{i}^{a} }}{{\sum\nolimits_{k} {\delta_{k}^{\alpha } } }}$$
(10)
where α determines the degree of priority. The range of α is [0,1], α = 0 means uniform sampling. α = 1 means greedy strategy sampling. It does not change the monotonicity of priority, but is used to increase or decrease the priority of TDerror experience.

A.
Sum tree
Prioritized DQN uses a sum tree to solve the problem of sorting samples before sampling. The sum tree is a binary tree, and the structure is shown in Fig. 3a. The top of the sum tree is the root node, each tree node has only two child nodes, the bottom layer is the leaf node, and the rest of the nodes are internal nodes. The numbers in the figure are the indexes of the nodes, starting from the root node 0. We also use an array T to store the corresponding sample tuples, as shown in Table 1, where idx represents the index of the sample tuple in the array T.

B.
Storing data
Assuming that the number of layers is denoted as l, and the number of layers of the binary tree is L, the number of nodes at each layer can be denoted as \(2^{{l{  1}}} \;(l = 1,2,......,L)\), and the total number of nodes in the binary tree is 2^{L}1. It can be found in Fig. 3a that the number of the leftmost leaf node can be expressed as \(2^{{l{  1}}}  1\;(l = 1,2,......,L)\). The index of array T corresponding to the number of the leftmost leaf node is 0, which is denoted as idx = 0. When storing a priority, the number of the leaf node and idx is increased by 1. The sum tree only stores the priorities of samples at the leaf node, i.e., the nodes at the L^{th} layer, and the priority of a leaf node is matched to a sample tuple in array T. In addition, the priority of an intermediate layer node is the sum of the priorities of its child nodes, and the priority of the root node is the sum of the priorities of all the nodes. The higher the value of the leaf node, the higher the priority of the sample. The priority of the sample is stored in the leaf node from left to right. The storage steps are given as follows:

1.
Number the 2^{L}1 nodes of the sum tree, and initialize the priorities of all the leaf nodes of the sum tree to 0;

2.
The priority of the current sample is stored in the leftmost leaf node, and the current sample tuple is stored in array T of which the index is idx = 0. At the same time, the priorities of the parent nodes of the whole binary tree are updated upward;

3.
Add the priority of the sample at the second leaf node of the sum tree. Then, the number of the leaf node can be expressed as 2^{L−1} (obtained by (2^{L−1}–1) + 1). The index of array T corresponding to this leaf node is 1 (obtained by 0 + 1). Then, add the sample tuple to array T of which the index is 1. Update the priorities of the parent nodes of the whole binary tree upward;

4.
According to the storage method above, the priorities of the samples are added to the leaf nodes one by one. When all the leaf nodes are filled, the subsequent priority will be stored in the first leaf node again.
The difference between the leaf node number of the sum tree and the index of the corresponding T is 2^{L−1}–1. The binary tree after storing the priority is shown in Fig. 3b. The leaf node numbered 7 has the highest priority, and this indicates that this node has the largest probability of being sampled.

C.
Sampling data
Denote the number of samples to be extracted as N, and the priority of the root node as P. Divide P by N, and the quotient M is obtained. Hence the total priority is divided into N intervals, and the jth interval is between [(j1)*M, j*M]. For example, if the priority of the root node is 1.12 and the number of samples is 8, the priority interval can be expressed as [0,0.14], [0.14,0.28], [0.28,0.42], [0.42,0.56], [0.56,0.70], [0.70,0.84], [0.84,0.98], and [0.98,1.12]. Sample a piece of data uniformly in each interval, and suppose that 0.60 is extracted in the interval [0.56,0.70]. Start traversing from the root node and compare 0.60 with the left child node 0.69. Since the left child node 0.69 is larger than 0.60, take the path of the left child node and traverse its child nodes. Then compare 0.60 with the left child node 0.50 of the node 0.69. Since 0.60 is larger than 0.50, subtract 0.50 from 0.60 to enter the right child node and traverse its child nodes. Compare 0.10 with the left child node of 0.19. Because 0.10 is larger than 0.01, take the path of right child node. Finally, the priority of the sample is 0.18, and the leaf node number of the sum tree is 10. At the same time, the sample corresponding to this leaf node is extracted from array T. After that, a number is uniformly selected from each interval, and then hold this number to sample samples according to the abovementioned method. Finally 8 samples are sampled.

D.
Importance sampling
The distribution of the samples used to train the network should be the same as its original distribution. However, since we tend to replay experience samples with high TDerrors more frequently, the sample distribution will be changed. This change causes a bias in the estimated value, and experience samples with high priority may be used to train the network more frequently. Importance sampling is used to adjust and update the network model by reducing the weight of the sample, so that the introduced error can be corrected [28]. The weight of importance sampling is,
$$w_{i} = \left( {\frac{1}{N}.\frac{1}{P(i)}} \right)^{\beta }$$
(11)
where N is the number of samples, P(i) is the probability of the sample which is calculated according to (10), \(\beta\) is used to adjust the degree of deviation. The slight deviation can be ignored at the beginning of learning. The effect of importance sampling to correct deviation is from small to large, so it increases linearly from the initial value, and converges to 1 at the end of training. When \(\beta = 1\), it indicates that the deviation has been completely eliminated.
Figure 4 shows the relationship between \(\beta\) and the number of iterations (the initial value is 0.4). It can be seen from the figure that at the end of the iteration, \(\beta\) can converge to 1, which means that the nonuniform probability is completely compensated, and the deviation caused by prioritized experience replay can be corrected.
In order to ensure the stability of learning, we always normalize weights, so (11) can be rewritten as,
$$\begin{gathered} w_{i} = \frac{{(N \cdot P(i))^{  \beta } }}{{\max_{j} w_{j} }} = \frac{{(N \cdot P(i))^{  \beta } }}{{\max_{j} [(N \cdot P(j))^{  \beta } ]}} \hfill \\ \;\;\;\;{ = }\frac{{(P(i))^{  \beta } }}{{\max_{j} [(\frac{1}{P(j)})^{\beta } ]}} = \left( {\frac{(P(i))}{{\min_{j} (P(j))}}} \right)^{  \beta } \hfill \\ \end{gathered}$$
(12)

2.
PrioritizedDQN based user grouping network
In this section, we introduce the user grouping framework based on Prioritized DQN. As shown in Fig. 2, the user grouping part contains prioritized experience replay. Prioritized DQN contains two subnetworks, a Q Network is used to generate the estimated Q value of the selected action, and a Target Q Network to generate the target Q value for training the neural network.
In our NOMA system, at the beginning of each TS t, the base station receives channel state information \(s_{t}\), and inputs it into the estimated Q Network of the Prioritized DQN. With \(s_{t}\) as input, the Q Network outputs all user combinations \(a_{t}^{1}\)\((a_{t}^{1} \in A1)\) and estimated Q value \(Q(s_{t} ,a_{t}^{1} ;\omega )\). In this paper, the \(\zeta  greedy\) strategy is used to select user combination \(a_{t}^{1}\), which randomly selects a user combination from A1 with probability \(\zeta\), or a user combination with the highest estimated Q value with probability \((1  \zeta )\). That is,
$$a_{t}^{1} = \arg \mathop {\max }\limits_{{a_{t}^{1} \in A1}} (s_{t} ,a_{t} ;\omega )$$
(13)
Finally, the user combination \(a_{t}^{1}\) and power \(a_{t}^{2}\) (produced by the next section) are given back to the NOMA system. According to the selected actions, the NOMA system generates instant rewards \(r_{t}\) and channel state information \(s_{t + 1}\) of the next time slot. We store the sample tuple \((s_{t} ,a_{t}^{1} ,r_{t} ,s_{t + 1} )\) of each TS into the memory block.
In each TS, in order to ensure that all samples can be sampled, Prioritized DQN sets the new samples to the highest priority, and stores the sample tuples and priorities in the experience pool following the storage steps in subsection 1A in Sect. 2.2.2 above. The sample tuples are selected according to the sampling method in subsection 1B of Sect. 2.2.2. As mentioned above, we use the probability of being sampled to calculate the sample weight (i.e., (12)), and use the target Q network to generate the target Q value for training the network, which is,
$$y_{i} = r_{i} + \gamma \mathop {\max }\limits_{{a_{i + 1}^{1} \in A1}} Q(s_{i + 1} ,a_{i + 1}^{1} ,\omega ^{\prime})$$
(14)
The loss function of Prioritized DQN can be expressed as,
$$loss = \frac{1}{N}\sum\limits_{i = 1}^{N} {w_{i} (y_{i}  Q(s_{i} ,a_{i}^{1} ;\omega ))^{2} }$$
(15)
Update all the weights \(\omega\) of the Q network in the Prioritized DQN through gradient backpropagation, and update all the parameters of the target Q network by copying the parameters of their corresponding network in every W TS, i.e. \(\omega ^{\prime} = \omega\).
After the parameters of the Q network of the Prioritized DQN are updated, it is necessary to recalculate the TDerror (i.e., (9)) of all the selected samples. Find the corresponding leaf node according to the number of the leaf node obtained by sampling, and set the TDerror to the priority of the sample. Follow the same method of storing data to update the priority of the sum tree leaf node and the priority of all its parent nodes.
2.2.3 Power allocation based on DDPG network
Since the output of DQN is discrete, it cannot be applied to a continuous action space. Fortunately, an ActorCriticbased DDPG network can handle continuous actions. Wang et al. [29] proposed two frameworks (i.e., DDRA and CDRA) to maximize the energy efficiency of the NOMA system, where DDRA is based on the DDPG network and CDRA is based on multiDQN. The results show that the time complexities of the two frameworks is similar, but the DDPG network performs better than the multiDQN network. This is because in multiDQN, the user power is quantized, resulting in the loss of some important information and causing poor performance. DDPG network is similar to DQN, using deep neural network and uniform sampling. It is also a deterministic policy gradient network, in which the action is uniquely determined in one state. Moreover, DDPG can handle continuous action tasks without quantifying the transmission power. Hence this paper uses the DDPG network to perform the user’s power allocation task.