 Research
 Open access
 Published:
NOMA resource allocation method in IoV based on prioritized DQNDDPG network
EURASIP Journal on Advances in Signal Processing volumeÂ 2021, ArticleÂ number:Â 120 (2021)
Abstract
To meet the demands of massive connections in the Internetofvehicle communications, nonorthogonal multiple access (NOMA) is utilized in the local wireless networks. In NOMA technique, various optimization methods have been proposed to provide optimal resource allocation, but they are limited by computational complexity. Recently, the deep reinforcement learning network is utilized for resource optimization in NOMA system, where a uniform sampled experience replay algorithm is used to reduce the correlation between samples. However, the uniform sampling ignores the importance of sample. To this point, this paper proposes a joint prioritized DQN user grouping and DDPG power allocation algorithm to maximize the system sum rate. At the user grouping stage, a prioritized sampling method based on TDerror (temporaldifference error) is proposed. At the power allocation stage, to deal with the problem that DQN cannot process continuous tasks and needs to quantify power into discrete form, a DDPG network is utilized. Simulation results show that the proposed algorithm with prioritized sampling can increase the learning rate and perform a more stable training process. Compared with the previous DQN algorithm, the proposed method improves the sum rate of the system by 2% and reaches 94% and 93% of the exhaustive search algorithm and optimal iterative power optimization algorithm, respectively. Although the sum rate is improved by only 2%, the computational complexity is reduced by 43% and 64% compared to the exhaustive search algorithm and the optimal iterative power optimization algorithm, respectively.
1 Introduction
Internet of Vehicles (IoV) is applied to support road safety, smart and green transportation and Invehicle Internet access, which is a promising technique to improve autonomous driving system performance. 5G is the core wireless technology used for IoV networks that provides ubiquitous connectivity and mass data transmission [1]. Among various new technologies in 5G, Nonorthogonal multiple access (NOMA) is utilized to support high capacity data transmissions by multiplexing the same time frequency resources by power division or code division [2,3,4]. Sparse code multiple access (SCMA) is a popular technology in code domain NOMA. The spreading sequences are sparse sequences, and SCMA can significantly improve system capacity through nonorthogonal resource allocation [5]. The principle of power domain NOMA technology is to use power multiplexing technology to allocate power to different users at the transmitter, and then superimpose multiple users on the same timeâ€“frequency resource block by superposition coding (SC) technology, and send them to the receiver by nonorthogonal method. At the receiving end, successive interference cancellation (SIC) is used to eliminate interferences from superimposed users [6].
Meanwhile, the connection ability of NOMA can make it applicable to future wireless communication systems (for example, cooperative communication, multipleinput multipleoutput (MIMO), beam forming and Internet of Things (IoT), etc.). Researchers combine NOMA and MIMO to give full play to their advantages, which can further improve the efficiency of the system in terms of capacity and reliability [7, 8]. Liu et al. [9] proposed a Kaband multibeam satellite IIoT, which improved the transmission rate of NOMA by optimizing the power allocation proportion of each node. The results showed that the total transmission rate of NOMA is much larger than that of OMA. He later [10] proposed a clusterbased cognitive industrial IoT (CIIoT), in which data were transmitted through NOMA. The results showed that the NOMA for the clusterbased CIIoT could guarantee transmission performance and improve system throughput.
When the NOMA system was proposed, its resource allocation problem was mainly studied by constructing the joint optimization of user grouping and power allocation, and to find the optimal solution by using typical algorithms such as convex optimization and Lagrange multiplication. Han S and others [5] used a Lagrangian dual decomposition method to solve the nonconvex optimization problem of power allocation, and the results showed that the optimized algorithm can significantly improve the system performance. Islam et al. [11] proposed a random user pairing method, in which the base station randomly selected users to form several user sets with the same number of users, and then put the two users with large channel gain difference in the user set into one group. Benjebbovu et al. [12] proposed an exhaustive user grouping algorithm. Zhang et al. [13] proposed an algorithm for user grouping based on channel gain. These algorithms could improve the system performance, but at the same time, the complexities were too high to apply to practice. SalaÃ¼n et al. [14] proposed a joint subchannel and power allocation algorithm, and the results showed that this algorithm had low complexity. However, due to the dynamism and uncertainty of the wireless communication system, it is difficult for these joint optimization algorithms of user grouping and power allocation to model the system and derive the optimal scheme. Without an accurate system model, the performance of the NOMA system may be limited.
In recent years, deep learning has been applied to wireless communication. Many scholars use neural networks to approximate optimization problems. Gui et al. [15, 16] used the neural network to allocate resources, and proposed a deep learning aided NOMA system. Compared with traditional methods, this method had good performance. Saetan and Thipchaksurat [17] proposed a power allocation scheme with maximum system sum rate. The optimal scheme was found by exhaustive search, and the optimal power allocation scheme was learned by training a deep neural network. The results showed that the scheme could approach the optimal sum rate and reduce the computational complexity. Huang et al. [18] designed an effective deep neural network which implemented user grouping and power allocation through a training algorithm, improved transmission rate and energy efficiency. However, deep learning itself cannot generate the learning goal, but by using optimization algorithms. Deep neural networks are trained based on the learning objectives provided by the optimization algorithm, which has the advantage of increasing the computational speed and reducing the running time. Therefore, deep learningbased approaches require the use of traditional optimization algorithms to generate optimal labels for training. In a complex system, it is difficult to obtain good training data, and the training is also very timeconsuming.
To solve these problems, deep reinforcement learning (DRL) is applied. Deep reinforcement learning is a combination of deep learning and machine learning. It uses the powerful representation ability of the neural network to fit Q table or direct fitting strategy to solve the problem of large state action space or continuous state action space [19]. Ahsan et al. [20] proposed an optimization algorithm based on DRL and SARSA algorithms to maximize the sum rate. The results showed that it could achieve a high accuracy with low complexity. Mnih et al.[21] proposed a Deep QNetwork (DQN) which was used as an approximator in many fields. He et al. [22] proposed a resource allocation scheme based on DRL, which expressed the joint channel allocation and user grouping problem as an optimization problem. Compared with other methods, the proposed framework could achieve better system performance.
When using the deep reinforcement learning network to allocate NOMA resources, there are many problems that need to be further solved. Firstly, the experience replay algorithm is used in DQN (the most commonly used deep reinforcement learning network) to reduce the correlation between samples and ensure the independent and identically distributed characteristics between samples, but the current sampling method of the sample pool is uniform sampling which ignores the importance of the sample. In the sampling process, some valuable samples may not be learned, which reduces the learning rate. Prioritized DQN algorithm [23] can solve the sampling problem in experience replay. It can improve the sampling efficiency and learning rate by using a sum tree and importance sampling. In addition, since the output of DQN can only be discrete, but the user power is continuous, although the power can be quantified, quantization will bring quantization error. Deep deterministic policy gradient (DDPG) network [24] can solve this problem, and use actorcritic structure to improve the stability of learning. Meng et al. [25] performed multiuser power allocation based on the DDPG algorithm. The results showed that the algorithm is superior to the existing models in terms of sum rate, and had better generalization ability and faster processing speed.
Aiming at the above problems in current NOMA resource allocation methods, this paper proposes a joint optimization method of user grouping and power allocation in the NOMA system based on deep reinforcement learning network. Firstly, this paper proposes a joint design of DQNDDPG network, in which DQN executes discrete tasks to perform user grouping, while DDPG network executes continuous tasks to allocate power to each user. Secondly, this paper proposes one solution to the problems existing in the random sampling methods, where temporal difference error (TDerror) is used to calculate the sample priority, and the valuable samples are sampled according to the priority. Besides, the sum tree is also utilized to speed up the search speed of priority samples.
The paper is organized as follows. SectionÂ 2 presents the system model of NOMA, formsÂ the optimization objective of this paper, describes the proposed NOMA system resource allocation algorithm based on deep reinforcement learning. SectionÂ 3 shows the numerical simulation results. SectionÂ 4 draws a conclusion.
2 Methods
2.1 System model
This paper researches on the resource allocation issue of an uplink multiuser NOMA system, where the base station (BS) is located in the center of the cell, and the users are randomly distributed near the base station. The total system bandwidth B is equally divided among S subchannels, and the users in the same subchannel are nonorthogonal. Assume there are U users and S subchannels in the system, and the maximum power transmitted by the base station is P_{max}. The signal transmitted on subchannel s is,
where \(x_{s,u} (t)\) and \(p_{s,u} (t)\) represent the data signal and allocated power of user u on subchannel s, respectively. \(b_{s,u} {(}t{) = 1}\) indicates that subchannel s is allocated to user u, and vice versa. The received signal can be expressed as,
where \(h_{s,u} (t) = g_{s,u} PL^{  1} {(}d_{s,u} {)}\) denotes the channel gain between the base station and user u on subchannel s. Assume that \(g_{s,u}\) is Rayleigh fading channel gain [26], \(PL^{  1} {(}d_{s,u} {)}\) is the path loss, and d_{s,u} is the distance between user u and base station on channel s. \(z_{s,u} (t)\) represents additive white Gaussian noise which follows the complex Gaussian distribution, i.e. \(z_{s,u} (t)\sim {\text{CN}} {(}0,\sigma_{n}^{2} {)}\).
In the NOMA system, due to the interference introduced by superimposed users, successive interference cancellation (SIC) technique is required to eliminate interference at the receiver. Firstly, the receiver decodes the users with high power levels, then subtracts it from the mixed signal, repeats this process until the desired signal has the maximum power in the superimposed signal, and regards the rest as interference signals. As a result, the signal to interference plus noise ratio (SINR) can be described as,
The data rate of user u on subcarrier s is defined as,
The user sum rate is,
The optimization objectives and constraints of the joint user grouping and power allocation problem are given as follows,
In the above constraints, C1 indicates that the power allocated to each user should be less than the maximum power transmitted by the base station. C3 and C4 indicate that multiple users can be placed on one subchannel. Because this objective function is a NonConvex optimization problem, it is difficult to find the global optimal solution. Although the global search method can find the optimal solution by searching all the grouping possibilities, the computational complexity is too high to apply in practice. Therefore, a DRLbased method is proposed for user grouping and power allocation in the NOMA system.
2.2 NOMA resource allocation based on DRL network
In this section, a NOMA resource allocation network based on DRL network is proposed. The description of the system structure is given in the following subsections.
2.2.1 System structure
The system structure is shown in Fig.Â 1. FigureÂ 1a is a general reinforcement learning network structure. The general reinforcement learning is mainly divided into the following five parts: agent, environment, state s_{t}, action a_{t}, and immediate reward r_{t}. The learning process of reinforcement learning can be described as follows: the agent obtains the state s_{t} from the environment, and then selects an action a_{t} from the action space and feeds it back to the environment. At this time, the environment generates a reward r_{t}, which is generated by choosing this action a_{t} in the current state s_{t}, and also generates state s_{t+1} of the next time slot. Then the environment gives them back to the agent. The agent stores learning experience in the experience replay pool to facilitate learning in the next time slot.
According to the structure of reinforcement learning, the system model designed in this paper is shown in Fig.Â 1b. Specifically, the NOMA system represents the environment of reinforcement learning. There are two agents, the Prioritized DQN user grouping network represents agent 1, and the DDPG power allocation network represents agent 2. We use channel gain as a characterization of the environment. Accordingly, the state space can be expressed as \(S = \{h_{1,1} (\text{t}),\;h_{2,1} (\text{t}), \ldots ,h_{s,u} (\text{t})\}\), the user group space can be expressed as \(A1 = \{ b_{1,1} (t),b_{2,1} (t), \ldots ,b_{s,u} (t)\}\), power allocation space are \(A2 = \{ p_{1,1} (t),\;p_{2,1} (t), \ldots ,p_{s,u} (t)\}\). Besides, the immediate reward is denoted as \(r_{t} = R\), where R is the system sum rate defined in (5). Our goal is to maximize longterm rewards, which is expressed as,
where Î³ is the fading factor. When \(\gamma { = }0\), it means that the agent only pays attention to the reward generated in the current state; when \(\gamma \ne 0\), it means that the agent also pays attention to future reward, and future rewards take more weight as Î³ increases.
The expected value of cumulative return R_{t} (obtained by (7)) of general reinforcement learning is defined as the Q value, which is determined by state s_{t}, and the selection of action a_{t} under a certain strategy \(\pi\). It is expressed as,
In summary, in each time slot (TS), the agent obtains the channel gain from the NOMA system, selects user combination and power in the action space according to current channel gain, and gives the action (optimal user groups and power) result back to the NOMA system. According to the received action, the NOMA system generates immediate reward and the channel gain of the next time slot, and then passes them to the agent. Based on the reward, the agent updates the decision function of selecting this action under the current channel gain, which completes an interaction. Repeat this process until the agent can generate an optimal decision under any channel gain. The specific design of the Prioritized DQN and DDPG network in Fig.Â 1b is illustrated in Fig.Â 2, and the detail description of them will be given in the following subsections.
2.2.2 User grouping based on prioritized DQN
In this article, we use the Prioritized DQN to perform user grouping, which is an improved network of DQN. The DQN includes two networks, the Q network generates the estimated Q value, and the target Q network generates the target Q values used to train the Q network parameters. The two networks are identical in structure but different in parameters. The Q network is trained with the latest parameters, while the parameters of the target Q network are copied from the Q network at intervals. The main idea of the DQN algorithm is to continuously adjust the network weight by optimizing the loss function produced by the estimated Q value and the target Q value. Moreover, experience replay is used in the DQN to reduce the correlation between samples. In DQN, all the samples are uniformly sampled from the experience replay pool. In this case, some important samples may be neglected, which will reduce the learning efficiency. In order to make up for the shortcomings of the random sampling from experience pool, a reinforcement learning method based on prioritized experience replay is proposed, which mainly solves the sampling problem in experience replay [27]. The main idea is to set priorities for different samples to increase the sampling probability of valuable samples. In this paper, we use Prioritized DQN to perform user grouping. In order to better understand the algorithm, we first introduce the prioritized experience replay knowledge.

1.
Prioritized experience replay
Temporaldifference error (TDerror) indicates the difference between the output action value and the estimated value. TDerrors produced by different samples are different, and their effects on backpropagation are also different. A sample with large TD error indicates that there is a big gap between the current value and the target value, which means that the sample needs to be learned and trained. Therefore, in order to measure the importance of the sample, we use TDerror to represent the priority of sample, which can be expressed as,
where Î´_{i} is the TDerror of sample i, \(\psi\) is a very small constant to ensure that samples with a priority of 0 can be selected, y_{i} is the target value defined in (14).
By setting priorities for samples, samples with large TDerrors may be sampled with high probabilities, and they will join the learning process more frequently. In contrast, samples with small TDerrors may not be replayed at all, because the TDerrors of them cannot be updated every time and are always small. In this case, the diversity of samples will be lost and result in overfitting. It is necessary to ensure that the sample with low priority can be sampled with a certain probability. Therefore, a probability of occurrence is defined as [23],
where Î± determines the degree of priority. The range of Î± is [0,1], Î±â€‰=â€‰0 means uniform sampling. Î±â€‰=â€‰1 means greedy strategy sampling. It does not change the monotonicity of priority, but is used to increase or decrease the priority of TDerror experience.

A.
Sum tree
Prioritized DQN uses a sum tree to solve the problem of sorting samples before sampling. The sum tree is a binary tree, and the structure is shown in Fig.Â 3a. The top of the sum tree is the root node, each tree node has only two child nodes, the bottom layer is the leaf node, and the rest of the nodes are internal nodes. The numbers in the figure are the indexes of the nodes, starting from the root node 0. We also use an array T to store the corresponding sample tuples, as shown in Table 1, where idx represents the index of the sample tuple in the array T.

B.
Storing data
Assuming that the number of layers is denoted as l, and the number of layers of the binary tree is L, the number of nodes at each layer can be denoted as \(2^{{l{  1}}} \;(l = 1,2,......,L)\), and the total number of nodes in the binary tree is 2^{L}1. It can be found in Fig.Â 3a that the number of the leftmost leaf node can be expressed as \(2^{{l{  1}}}  1\;(l = 1,2,......,L)\). The index of array T corresponding to the number of the leftmost leaf node is 0, which is denoted as idxâ€‰=â€‰0. When storing a priority, the number of the leaf node and idx is increased by 1. The sum tree only stores the priorities of samples at the leaf node, i.e., the nodes at the L^{th} layer, and the priority of a leaf node is matched to a sample tuple in array T. In addition, the priority of an intermediate layer node is the sum of the priorities of its child nodes, and the priority of the root node is the sum of the priorities of all the nodes. The higher the value of the leaf node, the higher the priority of the sample. The priority of the sample is stored in the leaf node from left to right. The storage steps are given as follows:

1.
Number the 2^{L}1 nodes of the sum tree, and initialize the priorities of all the leaf nodes of the sum tree to 0;

2.
The priority of the current sample is stored in the leftmost leaf node, and the current sample tuple is stored in array T of which the index is idxâ€‰=â€‰0. At the same time, the priorities of the parent nodes of the whole binary tree are updated upward;

3.
Add the priority of the sample at the second leaf node of the sum tree. Then, the number of the leaf node can be expressed as 2^{Lâˆ’1} (obtained by (2^{Lâˆ’1}â€“1)â€‰+â€‰1). The index of array T corresponding to this leaf node is 1 (obtained by 0â€‰+â€‰1). Then, add the sample tuple to array T of which the index is 1. Update the priorities of the parent nodes of the whole binary tree upward;

4.
According to the storage method above, the priorities of the samples are added to the leaf nodes one by one. When all the leaf nodes are filled, the subsequent priority will be stored in the first leaf node again.
The difference between the leaf node number of the sum tree and the index of the corresponding T is 2^{Lâˆ’1}â€“1. The binary tree after storing the priority is shown in Fig.Â 3b. The leaf node numbered 7 has the highest priority, and this indicates that this node has the largest probability of being sampled.

C.
Sampling data
Denote the number of samples to be extracted as N, and the priority of the root node as P. Divide P by N, and the quotient M is obtained. Hence the total priority is divided into N intervals, and the jth interval is between [(j1)*M, j*M]. For example, if the priority of the root node is 1.12 and the number of samples is 8, the priority interval can be expressed as [0,0.14], [0.14,0.28], [0.28,0.42], [0.42,0.56], [0.56,0.70], [0.70,0.84], [0.84,0.98], and [0.98,1.12]. Sample a piece of data uniformly in each interval, and suppose that 0.60 is extracted in the interval [0.56,0.70]. Start traversing from the root node and compare 0.60 with the left child node 0.69. Since the left child node 0.69 is larger than 0.60, take the path of the left child node and traverse its child nodes. Then compare 0.60 with the left child node 0.50 of the node 0.69. Since 0.60 is larger than 0.50, subtract 0.50 from 0.60 to enter the right child node and traverse its child nodes. Compare 0.10 with the left child node of 0.19. Because 0.10 is larger than 0.01, take the path of right child node. Finally, the priority of the sample is 0.18, and the leaf node number of the sum tree is 10. At the same time, the sample corresponding to this leaf node is extracted from array T. After that, a number is uniformly selected from each interval, and then hold this number to sample samples according to the abovementioned method. Finally 8 samples are sampled.

D.
Importance sampling
The distribution of the samples used to train the network should be the same as its original distribution. However, since we tend to replay experience samples with high TDerrors more frequently, the sample distribution will be changed. This change causes a bias in the estimated value, and experience samples with high priority may be used to train the network more frequently. Importance sampling is used to adjust and update the network model by reducing the weight of the sample, so that the introduced error can be corrected [28]. The weight of importance sampling is,
where N is the number of samples, P(i) is the probability of the sample which is calculated according to (10), \(\beta\) is used to adjust the degree of deviation. The slight deviation can be ignored at the beginning of learning. The effect of importance sampling to correct deviation is from small to large, so it increases linearly from the initial value, and converges to 1 at the end of training. When \(\beta = 1\), it indicates that the deviation has been completely eliminated.
FigureÂ 4 shows the relationship between \(\beta\) and the number of iterations (the initial value is 0.4). It can be seen from the figure that at the end of the iteration, \(\beta\) can converge to 1, which means that the nonuniform probability is completely compensated, and the deviation caused by prioritized experience replay can be corrected.
In order to ensure the stability of learning, we always normalize weights, so (11) can be rewritten as,

2.
PrioritizedDQN based user grouping network
In this section, we introduce the user grouping framework based on Prioritized DQN. As shown in Fig.Â 2, the user grouping part contains prioritized experience replay. Prioritized DQN contains two subnetworks, a Q Network is used to generate the estimated Q value of the selected action, and a Target Q Network to generate the target Q value for training the neural network.
In our NOMA system, at the beginning of each TS t, the base station receives channel state information \(s_{t}\), and inputs it into the estimated Q Network of the Prioritized DQN. With \(s_{t}\) as input, the Q Network outputs all user combinations \(a_{t}^{1}\)\((a_{t}^{1} \in A1)\) and estimated Q value \(Q(s_{t} ,a_{t}^{1} ;\omega )\). In this paper, the \(\zeta  greedy\) strategy is used to select user combination \(a_{t}^{1}\), which randomly selects a user combination from A1 with probability \(\zeta\), or a user combination with the highest estimated Q value with probability \((1  \zeta )\). That is,
Finally, the user combination \(a_{t}^{1}\) and power \(a_{t}^{2}\) (produced by the next section) are given back to the NOMA system. According to the selected actions, the NOMA system generates instant rewards \(r_{t}\) and channel state information \(s_{t + 1}\) of the next time slot. We store the sample tuple \((s_{t} ,a_{t}^{1} ,r_{t} ,s_{t + 1} )\) of each TS into the memory block.
In each TS, in order to ensure that all samples can be sampled, Prioritized DQN sets the new samples to the highest priority, and stores the sample tuples and priorities in the experience pool following the storage steps in subsection 1A in Sect.Â 2.2.2 above. The sample tuples are selected according to the sampling method in subsection 1B of Sect.Â 2.2.2. As mentioned above, we use the probability of being sampled to calculate the sample weight (i.e., (12)), and use the target Q network to generate the target Q value for training the network, which is,
The loss function of Prioritized DQN can be expressed as,
Update all the weights \(\omega\) of the Q network in the Prioritized DQN through gradient backpropagation, and update all the parameters of the target Q network by copying the parameters of their corresponding network in every W TS, i.e. \(\omega ^{\prime} = \omega\).
After the parameters of the Q network of the Prioritized DQN are updated, it is necessary to recalculate the TDerror (i.e., (9)) of all the selected samples. Find the corresponding leaf node according to the number of the leaf node obtained by sampling, and set the TDerror to the priority of the sample. Follow the same method of storing data to update the priority of the sum tree leaf node and the priority of all its parent nodes.
2.2.3 Power allocation based on DDPG network
Since the output of DQN is discrete, it cannot be applied to a continuous action space. Fortunately, an ActorCriticbased DDPG network can handle continuous actions. Wang et al. [29] proposed two frameworks (i.e., DDRA and CDRA) to maximize the energy efficiency of the NOMA system, where DDRA is based on the DDPG network and CDRA is based on multiDQN. The results show that the time complexities of the two frameworks is similar, but the DDPG network performs better than the multiDQN network. This is because in multiDQN, the user power is quantized, resulting in the loss of some important information and causing poor performance. DDPG network is similar to DQN, using deep neural network and uniform sampling. It is also a deterministic policy gradient network, in which the action is uniquely determined in one state. Moreover, DDPG can handle continuous action tasks without quantifying the transmission power. Hence this paper uses the DDPG network to perform the userâ€™s power allocation task.
3 Results and discussion
This section shows the simulation results of the abovementioned DRL based NOMA system user grouping and power allocation algorithms. Assume that there are 4 users to transmit signals on 2 channels, among which 4 users are randomly distributed in a cell with a radius of 500Â m and the minimum distance between the user and the base station is 30Â m. The path loss equation is PL^{âˆ’1}(d_{s,u})â€‰=â€‰38â€‰+â€‰15lg(d_{s,u})._{.} The total system bandwidth is 10Â MHz, and the noise power density is 110dBm/Hz. The maximum transmission power transmitted of the base station is 40dBm, and the minimum power is 3dBm.
In the Prioritized DQN, the number of leaf nodes is 500, the number of samples N is 32, when the training process is finished, the sample size is 6400. The reward discount factor Î³ is 0.9, and the greedy selection strategy probability setting \(\varsigma\) is 0.9. Set the initial deviation degree \(\beta { = }0.4\) and the priority degree \(\alpha { = }0.6\) in the prioritized experience replay.
3.1 Convergence of the proposed algorithm
FigureÂ 5 shows the convergence performance of the proposed algorithm. FigureÂ 5a shows the convergence of the sum rate of the proposed algorithm under different maximum transmission powers of the base station. Set the maximum transmission power of base station to 30dBm, 37dBm, and 40dBm, respectively. It can be observed that under different base station transmission powers, the sum rate of the system gradually increases and then tends to converge, which proves that the proposed algorithm has good convergence.
FigureÂ 5b compares the common DQN user grouping algorithm and analyzes the convergence performance of the proposed Prioritized DQN. Power is allocated to users based on the DDPG network. It is clear that the algorithm with prioritized sampling can reduce the training time and make the learning process more stable. It takes about 100 episodes for prioritized experience replay to complete the user grouping task, while around 300 episodes for the uniform experience replay to complete the same task. This is because prioritized experience replay stores the learning experience with priority in the experience pool, and traverses the sum tree to extract samples with high TDerrors to guide the optimization of model parameters, which alleviates the problem of sparse reward and insufficient sampling strategies, and improves learning efficiency. Also, prioritized experience replay not only focuses on samples with high TDerror to help to speed up the training process, but also involves samples with lower TDerror to increase the diversity of training.
3.2 Average sum rate performance of the proposed algorithm
The NOMA system resource allocation algorithm based on Prioritized DQN and DDPG proposed in this paper is denoted as PrioDQNDDPG. In order to verify the effectiveness of the proposed algorithm, this paper compares several resource allocation algorithms. The algorithms for comparison are ESEPA, ESMP, multiDQN, DQNDDPG and IPOP. Specifically, ESEPA uses an exhaustive searching method to select the best user combination, and uses the average power allocation scheme. ESMP also uses an exhaustive searching method to select the best user grouping, and the maximum power transmission algorithm is utilized to determine the power for each user. The multiDQN algorithm [29] uses multiple DQN to quantify power, and DQNDDPG algorithm selects the user grouping based on the DQN and allocates power for each user based on DDPG [30]. IPOP is an iterative power optimization algorithm [31], which finds the optimal solution by constructing a Lagrangian dual function. FigureÂ 6 shows the experimental results of user sum rate. All the experimental results are averaged every 200 TS to achieve a smoother and clearer comparison.
As can be seen from Fig.Â 6, The sum rate of the ESMP algorithm is lower than those of the other algorithms. This is because the power allocated to each user is the allowed maximum power, and strong interferences are caused among users. The performances of the multiDQN, DQNDDPG and PrioDQNDDPG algorithms are getting better as the number of episodes increases. The PrioDQNDDPG algorithm proposed in this paper is better than the other two DRLbased algorithms, respectively. Compared with the DQNDDPG algorithm, the algorithm proposed in this paper improves the system sum rate by 2%. This is mainly because Prioritized DQN sets the priority for some valuable samples that are beneficial to training the network; moreover, prioritized DQN uses the sum tree to store the priority, so that it is convenient to search experience samples with high priorities, and the valuable experience could be replayed more frequently, which can improve the learning rate and system sum rate. Compared with the multiDQN algorithm, this article uses the DDPG network to complete the user's power allocation. DDPG can handle continuous action tasks and solves the problem of quantization errors caused by quantization power.
Furthermore, the PrioDQNDDPG framework proposed in this paper interacts with the NOMA system, finds the optimal resource allocation strategy based on system feedback, and can dynamically find the optimal resource allocation strategy according to the changes of environment, which can reach 93% of the IPOP, and can reach 94% of the ESEPA algorithm. Although the sum rate of the proposed PrioDQNDDPG network is lower than IPOP and ESEPA, it can greatly reduce the computational complexity, which will be discussed in the following part.
3.3 Computational complexity analysis
This section analyzes the computational complexity of the proposed algorithm, and Fig.Â 7 shows the result. The computational time complexity of the algorithm in this paper is 10% higher than that of the traditional DQNDDPG algorithm. This is because the prioritized experience replay algorithm is mainly composed of setting sample priority, storing experience samples and extracting samples. It needs extra time to calculate the TDerror and traverse the sum tree. However, the prioritized experience replay algorithm can replay valuable samples frequently, which avoids some unnecessary DRL processes and reduces training time, compared with the optimal ESEPA and IPOP algorithms, the computational complexity is reduced by 43% and 64%, respectively.
4 Conclusion and future work
This paper proposes a joint user grouping and power allocation algorithm to solve the resource allocation problem in the multiuser NOMA system. While ensuring the minimum data rate of all users, we use a DRLbased framework to maximize the sum rate of the NOMA system. In particular, with the current channel state information as input and the sum rate as the optimization goal, we design a PrioritizedDQNbased network to output optimal user grouping strategy, and then use a DDPG network to output the power of all users. The proposed algorithm uses prioritized experience replay to replace previous uniform experience replay, which uses TDerror to evaluate the importance of samples, and uses the binarytreebased priority queue to store experience. The proposed sampling method allows the samples that are more useful for the learning process to be replayed more frequently. The simulation results show that the proposed algorithm with prioritized sampling can replay valuable samples at a high probability and increase the learning rate. In the power allocation part, there is no need to quantify the transmission power, and the powers of all users are directly output under the current state information. In addition, the joint algorithm proposed in this paper improves the sum rate of the system by 2% compared with the ordinary DQN algorithm, and reaches 94% and 93% of the optimal exhaustive search algorithm and iterative power optimization algorithm, respectively.
In addition, with the promising development prospects of NOMA on complex channels, how to allocate resources for cellfree massive MIMONOMA networks is a research focus of this article in the future.
Availability of data and materials
Please contact author for data requests.
Abbreviations
 IoV:

Internetofvehicle
 NOMA:

Nonorthogonal multiple access
 SIC:

Successive interference cancellation
 BS:

Base station
 DRL:

Deep reinforcement learning
 SC:

Superposition coding
 TS:

Time slot
 DQN:

Deep Q network
 DDPG:

Deep deterministic policy gradient network
 TDerror:

Temporaldifference error
 IPOP:

Iterative power optimization
References
W.U. Khan, M.A. Javed, T.N. Nguyen et al., Energyefficient resource allocation for 6G backscatterenabled NOMA IoV networks. IEEE Trans. Intell. Transp. Syst. (2021). https://doi.org/10.1109/TITS.2021.3110942
X. Liu, M. Jia, X. Zhang et al., A novel multichannel Internet of things based on dynamic spectrum sharing in 5G communication. IEEE Internet Things J. 6(4), 5962â€“5970 (2018)
X. Liu, X. Zhang, Rate and energy efficiency improvements for 5Gbased IoT with simultaneous transfer. IEEE Internet Things J. 6(4), 5971â€“5980 (2018)
X. Liu, X. Zhang, M. Jia et al., 5Gbased green broadband communication system design with simultaneous wireless information and power transfer. Phys. Commun. 28, 130â€“137 (2018)
S. Han, Y. Huang, W. Meng et al., Optimal power allocation for SCMA downlink systems based on maximum capacity. IEEE Trans. Commun. 67(2), 1480â€“1489 (2018)
K. Yang, N. Yang, N. Ye et al., Nonorthogonal multiple access: achieving sustainable future radio access. IEEE Commun. Mag. 57(2), 116â€“121 (2018)
S.R. Islam, N. Avazov, O.A. Dobre et al., Powerdomain nonorthogonal multiple access (NOMA) in 5G systems: potentials and challenges. IEEE Commun. Surv. Tutor. 19(2), 721â€“742 (2016)
Q. Le, V.D. Nguyen, O.A. Dobre et al., Learningassisted user clustering in cellfree massive MIMONOMA networks. IEEE Trans. Veh. Technol. (2021). https://doi.org/10.1109/TVT.2021.3121217
X. Liu, X.B. Zhai, W. Lu et al., QoSguarantee resource allocation for multibeam satellite industrial Internet of Things with NOMA. IEEE Trans. Ind. Inform. 17(3), 2052â€“2061 (2019)
X. Liu, X. Zhang, NOMAbased resource allocation for clusterbased cognitive industrial internet of things. IEEE Trans. Ind. Inform. 16(8), 5379â€“5388 (2019)
S.R. Islam, M. Zeng, O.A. Dobre et al., Resource allocation for downlink NOMA systems: key techniques and open issues. IEEE Wirel. Commun. 25(2), 40â€“47 (2018)
A. Benjebbovu, A. Li, Y. Saito et al. Systemlevel performance of downlink NOMA for future LTE enhancements. IEEE Globecom. 66â€“70 (2013). https://doi.org/10.1109/GLOCOMW.2013.6824963
H. Zhang, D.K. Zhang, W.X. Meng et al. User pairing algorithm with SIC in nonorthogonal multiple access system, in Proceedings of International Conference on Communications (2016), p. 1â€“6
L. SalaÃ¼n, M. Coupechoux, C.S.J.I.T.O.S.P. Chen, Joint subcarrier and power allocation in NOMA: optimal and approximate algorithms. IEEE Trans. Signal Process. 68, 2215â€“2230 (2020)
G. Gui, H. Huang, Y. Song et al., Deep learning for an effective nonorthogonal multiple access scheme. IEEE Trans. Veh. Technol. 67(9), 8440â€“8450 (2018)
M. Liu, T. Song, L. Zhang et al. Resource allocation for NOMA based heterogeneous IoT with imperfect SIC: a deep learning method, in Proceedings of IEEE Annual International Symposium on Personal, Indoor and Mobile Radio Communications (2018) p. 1440â€“1446
W. Saetan, S. Thipchaksurat. Power allocation for sum rate maximization in 5G NOMA system with imperfect SIC: a deep learning approach, in Proceedings of the 4th International, Conference on Information Technology (2019), p. 195â€“198
H. Huang, Y. Yang, Z. Ding et al., Deep learningbased sum data rate and energy efficiency optimization for MIMONOMA systems. IEEE Trans. Wirel. Commun. 19(8), 5373â€“5388 (2020)
Q. Liu, J.W. Zhai, Z.Z. Zhang et al., A survey on deep reinforcement learning. Chin. J. Comput. 41(1), 1â€“27 (2018)
W. Ahsan, W. Yi, Z. Qin et al., Resource allocation in uplink NOMAIoT networks: a reinforcementlearning approach. IEEE Trans. Wirel. Commun. 20(8), 5083â€“5098 (2021)
V. Mnih, K. Kavukcuoglu, D. Silver et al., Humanlevel control through deep reinforcement learning. Nature 518(7540), 529â€“533 (2015)
C. He, Y. Hu, Y. Chen et al., Joint power allocation and channel assignment for NOMA with deep reinforcement learning. IEEE J. Sel. Areas Commun. 37(10), 2200â€“2210 (2019)
T. Schaul, J. Quan, I. Antonoglou et al., Prioritized experience replay, in Proceedings of International Conference Learning, Representations (2015)
T.P. Lillicrap, J.J. Hunt, A. Pritzel et al., Continuous control with deep reinforcement learning, in ICLR (2015)
F. Meng, P. Chen, L. Wu et al., Power allocation in multiuser cellular networks: deep reinforcement learning approaches. IEEE Trans. Wirel. Commun. 19(10), 6255â€“6267 (2020)
I.H. Lee, H.J.I.W.C.L. Jung, User selection and power allocation for downlink NOMA systems with qualitybased feedback in rayleigh fading channels. IEEE Wirel. Commun. Lett. 9(11), 1924â€“1927 (2020)
J. Zhai, Q. Liu, Z. Zhang et al. Deep qlearning with prioritized sampling, in Proceedings of International Conference on Neural Information Processing (2016), p. 13â€“22
A. R. Mahmood, H. Van Hasselt, R. S. Sutton. Weighted importance sampling for offpolicy learning with linear function approximation, in Proceedings of the NIPS (2014), p. 3014â€“3022
X. Wang, Y. Zhang, R. Shen et al., DRLbased energyefficient resource allocation frameworks for uplink NOMA systems. IEEE Internet Things J. 7(8), 7279â€“7294 (2020)
Y. Zhang, X. Wang, Y. Xu. Energyefficient resource allocation in uplink NOMA systems with deep reinforcement learning, in Proceedings of International Conference on Wireless Communications and Signal Processing (WCSP) (2019), p. 1â€“6
X. Wang, R. Chen, Y. Xu et al., Lowcomplexity power allocation in NOMA systems with imperfect SIC for maximizing weighted sumrate. IEEE Access 7, 94238â€“94253 (2019)
Acknowledgements
Not applicable.
Funding
This work was supported by the Basic Scientific Research Project of Heilongjiang Province [Grant Number 2020KYYWF1003].
Author information
Authors and Affiliations
Contributions
YL proposed the framework of the whole algorithm; ML performed the simulations, analysis and interpretation of the results. XF and ZL have participated in the conception and design of this research, and revised the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
The picture materials quoted in this article have no copyright requirements, and the source has been indicated.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
He, M., Li, Y., Wang, X. et al. NOMA resource allocation method in IoV based on prioritized DQNDDPG network. EURASIP J. Adv. Signal Process. 2021, 120 (2021). https://doi.org/10.1186/s13634021008281
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13634021008281