Energy-efficient access point clustering and power allocation in cell-free massive MIMO networks: a hierarchical deep reinforcement learning approach

Cell-free massive multiple-input multiple-output (CF-mMIMO) has attracted considerable attention due to its potential for delivering high data rates and energy efficiency (EE). In this paper, we investigate the resource allocation of downlink in CF-mMIMO systems. A hierarchical depth deterministic strategy gradient (H-DDPG) framework is proposed to jointly optimize the access point (AP) clustering and power allocation. The framework uses two-layer control networks operating on different timescales to enhance EE of downlinks in CF-mMIMO systems by cooperatively optimizing AP clustering and power allocation. In this framework, the high-level processing of system-level problems, namely AP clustering, enhances the wireless network configuration by utilizing DDPG on the large timescale while meeting the minimum spectral efficiency (SE) constraints for each user. The low layer solves the link-level sub-problem, that is, power allocation, and reduces interference between APs and improves transmission performance by utilizing DDPG on a small timescale while meeting the maximum transmit power constraint of each AP. Two corresponding DDPG agents are trained separately, allowing them to learn from the environment and gradually improve their policies to maximize the system EE. Numerical results validate the effectiveness of the proposed algorithm in term of its convergence speed, SE, and EE.


Introduction
With the rapid development in mobile communication technology, wireless communication systems are continuously moving toward higher data rates and energy efficiency (EE).In order to fulfill the high-rate, ultra-reliable, and low-latency communication requirements of future applications, massive multiple-input multiple-output (MIMO) systems has emerged as an effective paradigm to support the growing popularity of mobile applications, in which the base station implements hundreds of antennas to efficiently serve a large number of user equipments (UEs) [1].In practical real-world applications of massive MIMO systems, a substantial quantity of antennas can be deployed either centrally or in a distributed fashion.In the former scenario, all antennas are positioned within a confined space, eliminating the need for fronthaul.Leveraging channel hardening and favorable propagation characteristic, massive MIMO reduces the transmission energy consumption while mitigating interference among UEs, thereby supporting greater throughput and more connectivity [2][3][4][5].In the latter scenario, on the other hand, distinct antennas are geometrically spaced apart but linked to a central processing unit (CPU) through a fronthaul network.Given the proximity of UEs to antennas, cell-free massive MIMO (CF-mMIMO) can reduce interference, enhance system capacity, and more effectively utilize radio resources, compared with the central massive MIMO.Furthermore, CF-mMIMO technology fully leverages macro-diversity and multi-UE interference suppression, offering a nearly consistent quality of service (QoS) to UEs.This, in turn, enhances system performance and extends coverage [6][7][8].
For CF-mMIMO systems, access point (AP) clustering and power allocation play an important role in enhancing the system performance.Because proper AP clustering and power allocation strategies can effectively reduce multi-UE interference, improve the system EE, and ensure UE's QoS [9].In recent years, there have been many works focused on AP clustering and power allocation.For example, the work [10] introduced a joint optimization approach encompassing cell, channel, and power allocation to maximize the overall UE throughput.In [11], an iterative optimization was conducted for AP clustering, linear least mean square error precoding, and power allocation to maximize the sum-rate in CF-mMIMO systems.Additionally, the authors of the work [12] introduced distributed algorithms to address the joint problem of AP selection and power allocation in a multi-channel, multi-AP network.Furthermore, the authors of [13] devised a joint optimization approach for pilot allocation, AP selection, and power control, to enhance the throughput of CF-mMIMO systems.However, these algorithms' computational complexity increase exponentially with the increase of the number of AP and UE in the above-mentioned works, which results in real-time processing limitations that may be difficult to apply in CF-mMIMO systems.Therefore, it is urgent to develop a low-complexity and good-scalability method for large-scale networks.
On the other hand, to perform the AP clustering and power allocation, an alternative approach is to utilize the "learn to optimize" methodology.This approach leverages the capability of deep neural networks (DNNs) to acquire intricate patterns and approximate complex function mappings.Deep learning (DL) has excelled in perceptual capabilities but has certain limitations in decision making.In contrast, reinforcement learning (RL) possesses decision-making abilities but has constraints in perception.Therefore, the integration of DL and RL, known as deep reinforcement learning (DRL), aims to overcome each other's shortcomings [14].It enables the direct learning of control strategies from high-dimensional raw data.This approach closely resembles human thinking and has shown immense potential in enhancing wireless communication performance.DRL operates based on rewards, allowing it to find solutions to convex and non-convex problems without the need for extensive training datasets.Additionally, the computational complexity required for generating DRL outputs is low, involving only a small number of simple operations.These characteristics make DRL a powerful technology poised to make significant advancements in the field of wireless communication.
There are many works on resource allocation of CF-mMIMO systems based on DRL [15][16][17][18][19][20][21][22][23][24].In [15], a power allocation method based on deep Q-network (DQN) was examined, taking into account the presence of imperfect channel state information (CSI).The authors of [16] introduced a distributive dynamic power allocation scheme, employing model-free DRL, to maximize a utility function based on the weighted sum-rate.The work [17] investigated the joint pilot and data power control, as well as the design of receiving filter coefficients in cell-free systems.It also presented a decentralized solution utilizing the actor-critic (AC) approach.Moreover, in [18], feed-forward neural networks were integrated into the output layer of the deep deterministic policy gradient (DDPG) algorithm.This integration aimed to mitigate the issue of over-fitting in DRL.In [19], a power control algorithm, utilizing the twin delayed deep deterministic policy gradient (TD3), was introduced.Furthermore, to concurrently optimize compute and radio resources, [20] tackled a multi-objective problem through the application of a distributed RL-based method and a heuristic iterative algorithm.In [21], the authors introduced a framework for the joint channels clustering and power allocating using a multi-agent DRL approach, known as double DQN (DDQN).In [22], the research focused on optimizing relay clustering and power level allocation in device-to-device transmissions.To address this challenge, a centralized hierarchical DRL (HDRL) approach was introduced with the aim of discovering the most effective solution to the problem.[23] introduced a multi-agent RL (MARL) algorithm to tackle complex signal processing problems in high-dimensional scenarios.This approach incorporates predictive management and distributed optimization, all the while taking into account a dual-layer power control architecture that leverages large-scale fading coefficients between antennas to reduce interference.The work of [24] introduced HDRL to achieve automatic unbalanced learning.At a higher level, decisions regarding the quantity of synthetic samples to generate are made, while at a lower level, the determination of sample locations is influenced by the high-level decision, considering that the optimal locations may vary depending on the sample quantity.[25] explored DRL within a hierarchical framework and presented the hierarchical DQN (H-DQN) model.This approach breaks down the primary problem into autonomous sub-problems, with each one being handled by its dedicated RL agent.[26] demonstrated tasks that H-DQN cannot solve, highlighted limitations in such hierarchical frameworks, and described the recursive hierarchical framework that summarizes an architecture using recursive neural networks at the meta-level.
In a word, many research works demonstrated the excellent performance of DRLbased methods in resource allocation in CF-mMIMO systems.However, while DRL is excellent at solving the optimization problems, its limitations become apparent when dealing with some problems with hierarchy and complexity.This provides an opportunity for the introduction of HDRL.HDRL uses a hierarchical framework that breaks down problems into independent sub-problems and is handled by a dedicated DRL agent, allowing for better challenges when dealing with complex problems.Especially when the joint optimization problem of AP clustering and power allocation is conducted, HDRL can efficient handle multiple levels of information and decision making.This is mainly because HDRL can flexibly deal with hierarchical relationships, the original optimization problem is divided into more manageable sub-problems, so as to better adapt to the complexity and uncertainty in the actual scenario.Therefore, although DRL is excellent in solving most of optimization problems for some specific joint optimization problems, HDRL's hierarchical structure and flexibility can provide a more efficient solution.In this paper, how to make better use of the advantages of DRL and HDRL, and how to choose appropriate methods, is an important direction of resource allocation in CF-mMIMO systems.
More recently, it is important to emphasize that the AP-UE association network structure is tailored to the channel statistics to harness array gain, while power allocation is customized based on the instantaneous effective CSI to attain spatial multiplexing gain.Furthermore, AP clustering constitutes a global decision for the entire system, while power allocation involves local decisions for each specific link or connection after connection establishment.Therefore, two sub-problems need to be addressed at different timescales and levels since they have distinct scopes of influence and optimization objectives.Fortunately, the HDRL algorithm can decompose this joint optimization problem into two sub-problems, allowing optimization at different levels.Moreover, for CF-mMIMO systems, the total number of APs and UEs is very large and thus the joint optimization of AP clustering and power allocation would induce prohibitively high complexity.However, in practical applications, especially for mobile communication systems, the joint optimization needs to be completed in real-time.This requires the algorithm to have fast convergence speed and low latency to adapt to the network dynamics and user mobility.
To address the above challenges, we propose a hierarchical depth deterministic strategy gradient (H-DDPG) framework for joint optimization of AP clustering and power allocation in CF-mMIMO systems using the idea of "divide and conquer." The framework employs a two-layer control network operating on different timescales to improve the system EE performance.The main contributions of this paper can be summarized as follows.
• In this paper, we proposed a joint optimization framework for resource allocation of CF-mMIMO systems, taking into account AP clustering and power allocation.Unlike traditional methods, this integrated approach helps to coordinate system resources, reduce multi-user interference and power consumption, while meeting the constraints of the minimum spectral efficiency (SE) per UE and the transmitted power budget per AP, and thus achieve a better EE performance.• To solve the formulated joint optimization problem, we propose a H-DDPG algorithm.The algorithm combines the principles of DRL and hierarchical optimization, and creates a collaborative optimization framework that combines AP clustering and power allocation.By running two-layer control networks on different time scales, the decomposition and processing of system-level and link-level problems are realized, with the goal of maximizing the EE of the system.For system-level problems, namely AP clustering, the implement of DDPG on a large timescale enhances the configuration of wireless networks and helps to improve the overall performance of the network.For link-level problem, i.e., power allocation, by utilizing DDPG on a small timescale, interference between APs is successfully reduced and the transmission performance is improved.
• The efficiency of the proposed approach is verified through numerical simulations.
The simulation results demonstrate that compared to single-layer DDPG approaches, the proposed two-tier H-DDPG method, exhibits substantial enhancement in terms of the convergence speed, EE, and SE performance.
The following sections of this paper are organized as follows: Section 2 provides a detailed description of the system model.In Section 3, the power consumption model is elaborated.Section 4 is dedicated to formulating the optimization problem.Section 5 expounds on the principles and design of the H-DDPG algorithm.Section 6 discusses the simulation results.Finally, Section 7 offers the paper's conclusion.

System model
As illustrated in Fig. 1, we consider a CF-mMIMO system, comprising K arbitrarily located single-antenna UEs and L geographically distributed APs, with each AP being equipped with N antennas.In the context of implementing the UE-centric cell-free architecture, where each UE is catered to by a specific subset of APs.All APs are connected to the CPU via fronthaul links.
In this paper, time division duplex (TDD) protocol and a block fading model are adopt, where time-frequency resources are divided into coherent blocks so that the channel coefficients in each block can be assumed to be fixed.The coherent block consists of τ c symbols, as shown in Fig. 2, and the channels are independently and randomly distributed in each coherent block.In the TDD-based protocol, each coherent interval τ c is divided into three stages: (i) the uplink channel estimation phase, (ii) the downlink data transmission phase, and (iii) the uplink data transmission phase.In this paper, we only focus on resource allocation to maximize the system EE of the downlink in CF-mMIMO systems.The proposed algorithms in this work can be readily applied to the uplink scenario, after some straightforward simplification.Assuming the duration of uplink training phase is τ p , the remaining coherence interval (τ c − τ p ) is used for the downlink data transmission.
In the context of spatially correlated Rayleigh fading, the channel linking the k-th UE and the l-th AP is denoted as h kl ∈ C N and is characterized as signifying the spatial correlation matrix.Conversely, when transmitting from the l-th AP to the k-th UE, the average channel gain for any specific antenna is determined through the normalized trace β kl = 1 N tr(R kl )

Channel estimation
In the channel estimation phase, each AP independently obtains the CSI through local estimation, using the uplink pilot transmissions from UEs.Consider a collection of mutually orthogonal τ p pilot sequences allocated to UEs, where t k denotes the index of the pilot assigned to the k-th UE, and S t k represents the group of UEs sharing pilot t k .When a UE from set S t k transmits pilot t k , the received signal y p t k l ∈ C N at the l-th AP is determined by computing the inner product between the received signal and pilot sequence t k as follows: where p p represents the pilot transmitting power per UE, and n t k l ∼ N C 0, σ 2 I N denotes the additive Gaussian noise vector at the l-th AP.By utilizing the MMSE estimator at the l-th AP, the channel estimation between the k-th UE and the l-th AP can be expressed as where signal's correlation matrix.

Data transmission
Let ς i ∈ C symbolize the downlink data signal with unit power for the i-th UE, where For the l-th AP, the CPU encodes the associ- ated data symbol and transmits it to the l-th AP over fronthaul links.Then, the transmitted signal at the l-th AP is given as (1) where w il ∈ C N represents the normalized precoding vector at the l-th AP for the i-th UE, i.e., �w il � 2 = 1 , and ρ il ≥ 0 signifies the power allocated to the i-th UE by the l-th AP.
Suppose that each UE is served by all APs over the same time-frequency resources.Accordingly, the received signal at the k-th UE can be expressed as where n k ∼ N C 0, σ 2 represents the noise at the k-th UE.
As stated in [27], the achievable SE of the k-th UE is defined as where This SE expression is applicable to any choice of the transmission precoding vector w k .In fact, it also applies to any channel allocation, not just the associated Rayleigh fading.The expression can be calculated using Monte Carlo methods, which involves approximating each expectation with a substantial number of randomly generated sample averages.More precisely, we can generate an implementation of the channel estimate in a large set of coherent blocks.The normalized precoded vector is denoted as w il = w il /�w il �.This paper adopts the linear precoding minimum mean square error precoding scheme.

Power consumption model and energy efficiency
This section presents the definition of a general power consumption model.The main elements of the power consumption model are given as follows: (1) the power consumption at the radio site, encompassing power utilization by UEs P ue k : ∀k APs P ap l : ∀l , and fronthaul links P fh l : ∀l (2) the CPU power consumption P cpu [28].The total power consumption is represented as The power consumption at the k-th UE is given as (4) is the power consumption of the internal circuitry, while the second term accounts for the power usage during uplink transmission, and 0 < η ue ≤ 1 denotes the power amplifier efficiency at the UE.τ p τ c is represented as the fractions of the uplink pilot transmission, the power consumption at the l-th AP is wherein the internal circuit power of each AP antenna is represented by P c,ap l D l ⊂ {1, . . ., K } is the subset of UEs served by the l-th AP, P pro l as the power consumed to process the received/transmitted signals of each UE in D l , and 0 < η ap ≤ 1 as the power amplifier efficiency at the AP.For each fronthaul link, the power consumption is where P fix l represents the fixed power consumption, and the remainder describes the load-related uplink and downlink signaling, with P sig l representing the signaling power for each UE.The CPU is charge of processing all of the UE's signals and has power consumption where P fix cpu represents the fixed power consumption, B stands for the system bandwidth, and P cod cpu denotes the energy consumption per bit for the initial encoding at the CPU.As per the established power consumption model, the overall system EE in the considered CF-mMIMO system (in bits/joules) is

Problem formulation
The focus of this paper is on maximizing the overall system EE in the considered CF-mMIMO system through a joint optimization of AP clustering and power allocation, while adhering to constraints associated with the SE requirements of individual UEs and the transmit power budgets of each AP [29].The problem is formulated as (8) where SE min k represents the minimum spectral efficiency required for the k-th UE, and P max represents the maximum transmit power of each AP.
It is evident that the optimization problem in ( 13) is non-convex, making it challenging to obtain a global optimal solution that adheres to the non-convex constraint C1 and C2.Furthermore, in CF-mMIMO systems, the number of APs and UEs is very large, leading to a high complexity when we jointly optimize AP clustering and power allocation.To address these challenges, in the following section, the HDRL algorithm is developed to obtain a optimal solution.

Joint optimization of AP clustering and power allocation based on H-DDPG algorithm
This section introduces a two-layer DRL framework for addressing the optimization problem ( 13) by employing the HDRL framework.This framework incorporates both high-level and low-level control strategies to achieve the network optimization through hierarchical coordination.

HDRL framework
Since the joint optimization of AP clustering and power allocation usually involves a large number of nonlinear constraints.Traditional optimization methods cannot effectively solve nonlinear problems, and non-convex optimization algorithms result in cubic-order complexity, which is impractical in CF-mMIMO systems.The DRL algorithm has some advantages in dealing with complex nonlinear problems.Moreover, HDRL is a DRL framework composed of two distinct levels of DRL.This hierarchical approach aims to address complex decision tasks by partitioning them into high-level and low-level control layers.The advantages of HDRL compared to DRL are as follows.
1. HDRL uses a hierarchy to break down complex tasks into smaller sub-tasks with their own policies and rewards, improving efficiency in solving complex problems.2. HDRL promotes knowledge sharing between different levels, allowing low-level strategies to benefit from higher-level guidance, accelerating effective strategy learning.

HDRL balances exploration and utilization by involving different levels of strategy.
High-level policies focus on global decision making, while low-level policies provide detailed exploration and adjustment, enhancing environmental exploration and strategic refinement.4. Because HDRL can learn strategies more efficiently, especially in the case of reasonably divided hierarchies, it usually has faster convergence.This means it can find a better strategy in a relatively short period of time.
Therefore, this paper proposes a distributed two-layer DRL framework, as shown in Fig. 3, in which the high-level control strategy and the low-level control strategy work successively in a distributed manner.In each layer of HDRL, agents learn their dynamic environment through repeated sequences of observations, actions, and rewards.In slot t, by observing state s t , the agent takes action a t ∈ A based on some strategy π, and then receives a reward r t from the environment and advances to the following state s t+1 .The quadruple (s t , a t , r t , s t+1 )represents a single interaction with the environment [30].Define e t as an empirical sequence, where e t = (s t , a t , r t , s t+1 ) .To maximize cumulative rewards, the agent seeks an optimal strategy.
The two primary methods for determining the optimal strategy are the value-based approach, DQN, and the policy-based approach, DDPG, respectively.Significant spatial capacity is necessary to accommodate the potentially substantial quantity of UEs and APs within the CF-mMIMO system, and continuous motion space due to power allocation issues.Traditional RL methods, such as Q-learning and DQN, are commonly used to deal with discrete action spaces where agents choose one of a limited set of discrete actions.DDPG and its enhanced iterations are well-suited for addressing challenges in scenarios with continuous action spaces.It combines deterministic strategy gradient method and DNN, which can effectively learn and optimize continuous action strategies.Hence, the DDPG algorithm can obtain rapid and consistent learning rate.
The DDPG algorithm consists of four networks: the actor network, the critic network, and their respective target networks, each with specific relationships between them.The role of the actor component is to produce an action for each time step using a deterministic strategy denoted as µ(s t |θ µ ) , learned by a DNN with weight .Update the weight θ µ to find the best deterministic strategy µ(s t |θ µ ) using the action value function, with the anticipated long-term reward defined as: Here, R t represents the cumulative discounted future reward.In general, θ µ is updated by gradient ascent: Here, α µ represents the learning rate, and J (θ ) stands for ( 14) Fig. 3 HDRL algorithm framework J (θ) represents the anticipated target value of s t .It is worth noting that the action value function in Eq. ( 14) permits recursive relationships This is what is required to determine Q(s t , µ(s t |θ µ )) given by (15) in the critic section.More specifically, the critic uses a separate DNN with weight θ Q to evaluate the action value function Q s t , µ(s t |θ µ )|θ Q .Typically, weight θ Q is updated using the following method: Here, α Q represents the learning rate, and L(θ ) stands for where L(θ) represents the mean squared Bellman error function of the target.
The target actor network is essentially a duplicate of the actor network, but it undergoes gradual parameter updates aimed at stabilizing the training process.The primary role of the target actor network is to supply target actions, thereby minimizing the training error of the actor network.The target critic network is a duplicate of the critic network, and it also undergoes gradual updates.Its function is to provide target value function to reduce the training error of critic network.The target critic network's output is utilized to compute the critic network's error, facilitating the back-propagation of the error signal and, consequently, updating the critic network.In the course of training, the parameters of the target actor network and target critic network are methodically aligned with those of the corresponding actor network and critic network at specific intervals.This process aids in stabilizing the training, minimizing fluctuations, and enhancing the algorithm's convergence.The structure of the DDPG algorithm is illustrated in Fig. 4.

Action dichotomy scheme based on DDPG
The advantage of splitting DDPG actions into AP clustering and power allocation actions is that it allows the agent to consider these two critical factors simultaneously in a single time step and optimize system performance in a more integrated manner.This design helps to reduce communication latency, increase network capacity, reduce interference levels, and provide a better user experience [31].

State s t
obtained from the low-level AP clustering,The state space comprises the AP cluster AP t−1 selected by the AP in the previous time slot, the channel's large-scale fading coef- ficient β lk , the transmit power p t−1 from the previous slot, and the UE's SINR.Defined as follows is the state of the t-th time slot: The agent's action is divided into AP clustering and power allocation; thus, the action of the t-th time slot is defined as follows: 3. Reward r(s t , a t ).
Defining (22) is the reward function of the t-th time slot according to the goal in the problem.
where u(x) is a step function, with a value of zero for x ≤ 0 and one otherwise.The steps of the DDPG algorithm closely resemble those of the H-DDPG algorithm presented below, and will not be introduced here.(

H-DDPG-based method
Considering the complexity of AP clustering and power allocation, a HDRL model with DDPG is proposed to reduce the search space.The layered model consists of two DRL agents and is optimized with power allocation as the meta-controller.The power allocation agent PA s PA t , a PA t , r PA t is responsible for assigning transmission power to the UE with the objective of optimizing the signal quality, such as the signal-to-noise ratio, at the UE terminal, and enhancing the overall system throughput, all while minimizing interference.The AP clusters the agent AP s AP t , a AP t , r AP t to select the service AP for all UEs and makes it meet the constraint of minimum SE.Typically, the AP clustering agent depends on the power allocation agent for transmitting packets, and the reward for the AP clustering agent actions depends on the actions of the power allocation agent.Moreover, the input s PA t for the power allocation agent depends on the actions of the AP clustering agent and the environmental state.
Specifically, two DDPG agents are designed, corresponding to AP clustering and power allocation, respectively.In the first layer DDPG, a state space is established, which includes CSI of different base stations, UE equipment location and channel quality.Based on this state information, a strategy network is trained to cluster the appropriate set of APs to maximize the data transfer rate.Further considered in the second layer of DDPG is the power allocation problem.Using the current network state and the AP set clustered in the first layer of DDPG, the second policy network is trained to optimize the power allocation for each AP, aiming to achieve maximum EE while considering power consumption.They evolve their strategies through constant interaction.Verified by a large number of numerical simulation experiments are the effectiveness and performance advantages of the proposed method.The definitions of state, action, and reward functions related to the H-DDPG algorithm are as follows.

• AP Clustering
Because AP clustering is a global decision for the entire system, and power allocation is a local decision for each specific link or connection after the connection is established.These two problems need to be solved at different timescales and levels because they have different spheres of influence and optimization goals.Therefore, AP is clustered as high-level DDPG, and its DDPG algorithm is designed as follows: The state space S is a collection of states observed by the agent within the environ- ment.It includes the AP cluster AP t−1 from the previous time slot, the large-scale fading coefficient β lk of the channel, and the SINR of the UE.Defined as follows is the state of the t-th time slot:

Action a t
The action space A is a collection of actions undertaken by the agent in response to each state.In the context of the AP clustering problem, an assignment variable α lk ∈ {0, 1} is defined, where k = 1, . . ., K , l = 1, . . .L .When the k-th UE is served by the l-th AP, it is equal to 1; otherwise, α lk = 0 .The joint clustering of all α lk forms an action space composed of discrete points of 2 kl .The action for the t-th time slot is defined as follows: 3) Reward r(s t , a t ).According to the goal in the problem, the reward function of the t-th time slot is defined as: Through the step function, a uniform penalty is created for all UEs who do not achieve the minimum SE [32].
The steps of high-level DDPG algorithm are shown in Algorithm 1.Initially, the actor network and critic network, along with their respective target networks, are initialized.Simultaneously, the experience replay buffer and noise are also initialized.Using the actor network, an action is selected based on the current state, if a H t is greater than zero, then consider action a H t = 1 , otherwise set a H t = 0 .Execute the selected action in the environment, then record the reward and the next state.Storing the current state, action, reward, and the next state within the experience replay buffer.Randomly selecting a batch of experiences from the replay buffer for training the actor and critic networks.The target value is computed with the assistance of the target actor network and target critic network, aiming to minimize the error of the critic network.By applying the target actor network to the next state, the target value is estimated.To minimize the critic network error, the critic network error is calculated, and the gradient descent method is employed to update the critic network weights.The policy gradient method updates the weights of the actor network by maximizing the value function of the critic network associated with the actor network's output.Soft updates are employed to enhance training stability by gradually adjusting the parameters of the target actor and critic networks to align with those of the actor and critic networks.Repeat the training cycle until the predetermined number of training cycles is reached.

Power allocation
As the low level of the H-DDPG algorithm, power allocation is taken, and the DDPG algorithm is designed as follows. (24) Consisting of the AP clusters AP t obtained from the low-level AP clustering, the large-scale fading coefficient β lk of the channel, the transmit power p t−1 of the.

Algorithm 1
The main step of the high-level DDPG algorithm previous time slot, and the SINR of the UE, the state space is defined.Defined as follows is the state of the t-th time slot:

Action a t
The action of the agent is defined as the transmit power assigned by the AP to the UE.The action of the t-th time slot is defined as follows: (26) In accordance with the problem's objective, the reward function for the t-th time.

Algorithm 2
The main steps of low-level DDPG algorithm slot is defined as follows: It is important to emphasize that although the high-level and low-level rewards are expressed differently, they both constitute integral components of their respective global rewards.Furthermore, the constraints on high-level and low-level strategies are identical, (27) ensuring a monotonically improving EE throughout the system.The procedure of low-level DDPG algorithm is shown in Algorithm 2.
The most significant distinction between Algorithm 2 and Algorithm 1 lies in the.
Algorithm 3 he main steps of H-DDPG algorithm action design.Algorithm 2 handles action a L t to adhere to the maximum transmit power constraint of each AP.The idea of H-DDPG is to update the high-level and lowlevel networks sequentially, where the agents continuously adjust the high-level and low-level decisions to improve the transmission performance of the whole system.In particular, the high-level network policy is adjusted using a stable low-level policy to inter-AP interference, whereas the low-level policy is updated with a more convergent high-level network policy to enhance system performance.The two networks continually update each other to attain their respective optimal strategies [33].
In this paper, we focus on the convergence and cumulative returns of the proposed framework H-DDPG in different training periods.During each training session, the two agents utilize the policy with the maximum reward to select the AP and assign power to each UE in the network, and then update their policies at the end of each training session.The complete H-DDPG algorithm steps are summarized in Algorithm 3.

Computational complexity
H-DDPG consists of two layers of DDPG, so its complexity is the sum of the superposition of two layers.In the DDPG layer for AP clustering, the input dimension is L(2K − 1) + K , and the output dimension is LK.Hence, the computational complexity of both the actor networks and their corresponding target networks is represented as , and the complexity of the critic networks and their target net- works is denoted as . The over- all complexity is represented as O(ap) = O(a1) + O(c1) .Moreover, in the DDPG layer for power allocation, the input dimension is L(3K − 1) + K and the output dimension is LK, the complexity of the actor networks and their target networks is symbolized as Similarly, the complexity of the critic networks and their target networks is expressed as

Numerical results
In this section, the implementation details of the HDRL method for solving the optimization problem given by (13) are presented, followed by evaluating its performance and comparing it with other existing methods.

Simulation parameters
The network topology comprises L = 12 APs and K = 8 UEs randomly distributed within a 0.5 km radius.The number of antennas per AP is denoted as N = 4 .In the case of each randomly generated topology as described above, the positions of UEs and APs remain constant during the evaluation phase.A random pilot assignment is employed, in which each UE randomly selects a pilot sequence from an orthogonal pilot pool with a length of τ p .Considering the communicate operates on a 20 MHz channel.The total receiver noise power is -94 dBm.Each AP has the maximum downlink transmit power of p max = W , while each UE has an uplink power of p i = 100 mW during the pilot transmit phase.The coherent block length is set to [34][35][36].The path loss model used to calculate the LSF coefficient is defined as follows: where d kl represents the distance between the k-th UE and the l-th AP.Table 1 summa- rizes the simulation parameters and represents the urban micro-area environment.In this algorithm, the high and low layer networks have the same structure.The actor neural network consists of a fully connected layer of input layer and average dimension hidden layer, and it is connected to a second fully connected hidden layer of output action space dimension by batch normalization and layer normalization.The number of hidden layer neurons is the average over the input dimension and output dimension, respectively.Moreover, critic neural network is divided into two parts: (29)  state and action.The state part includes the input layer, the full connection layer of the average dimension hidden layer, which is processed by batch normalization and layer normalization.The action part connects the status part output and the action input with the full connection layer.The number of hidden layer neurons is twice that of the average input dimension.The critic network finally outputs Q-value through the output layer.ReLU is chosen as the activation function for the high-level aspect, while the Sigmoid function is selected for the low-level aspect.
The learning rate for training the actor network is set to δ = 1e − 4 , the learning rate for training the critic network is δ = 1e − 3 , the discount factor is 0.99, and the soft update parameter, which governs the gradual update of target network parameters, is 0.001 [37].The target network is created as a duplicate of both the actor network and the critic network to ensure training stability.The neural network was trained using minibatch stochastic gradient descent (mini-batch SGD) with a mini-batch size of 128, and an experience replay buffer with a maximum capacity of 10,000.It is assumed that all UEs have the same SE min k , as CF-MIMO systems offer the advantage of delivering a consistently satisfactory service to all UEs in the network.

Simulation results discussion
To evaluate the performance of the proposed H-DDPG-based methods, three other algorithms were used for comparison.One approach is to employ a single-layer DDPG algorithm to address the challenge of jointly optimizing AP clustering and power allocation.The single-layer DDPG framework generates an action corresponding to AP clustering and power allocation at each time slot, which is represented as "S-DDPG." The second approach involves applying the DDPG algorithm to AP clustering, while power allocation is addressed using a conventional method, referred to as "AP-DDPG".The third is the traditional joint optimization scheme [38], which is denoted as "LP-MMSE." Compared to H-DDPG, S-DDPG is used to compare the advantages of hierarchical structure and provides an alternative method to solve the joint optimization problem.Moreover, the AP-DDPG scheme is chosen to independently study the effects of AP clustering and power allocation, which facilitates to understand the advantages of DRL in specific problems and the applicability of traditional methods for large-scale problems.As a traditional optimization scheme, LP-MMSE provides a benchmark against DRL-based methods.For S-DDPG, the total complexity of the actor network and its corresponding target network is denoted as O(a) = (5LK − L + K ) 2 .Similarly, the complexity of the critic network and its associated target network is represented as Since SE and EE affect each other, when SE is increased, EE may decrease.It is evident from Figs. 5 and 6 that when subjected to similar SE constraints, the CF-mMIMO system employing H-DDPG attains the highest EE.It improves EE by nearly 35% and 19% compared to S-DDPG and AP-DDPG, respectively.Compared with traditional optimization schemes, the EE performance gains are more significant.Hence, while the computational complexity of the S-DDPG and AP-DDPG algorithms is lower compared to the H-DDPG algorithm, but it is evident that the optimization performance of the H-DDPG algorithm is significantly superior to both of these algorithms.This is attributed to the fact that H-DDPG decomposes the problem into two levels: a high level, which addresses AP clustering, and a low level, focusing on power allocation, with each level having its dedicated policy network.This layered approach makes the optimizing decisions at different levels more flexible and efficient.In addition, the hierarchical approach helps the policy network reach convergence more easily because the action space and state space at each stage are relatively small.Simultaneously, the high-level and low-level policies can be optimized to enhance system-level and link-level performance, thereby elevating the overall system's performance.Moreover, it is important to note that the curve performs poorly at the beginning, but converge rapidly as the As depicted in Fig. 7, it is evident that within the H-DDPG algorithm, the number of UEs meeting the minimum SE requirement gradually rises with the advancement of iterations.Furthermore, it is evident that even when the minimum SE threshold is set at a high value and the EE does not exhibit significant differences, the majority of UEs can still satisfy this threshold.This highlights the advantages of H-DDPG algorithms in the tradeoff of SE and EE, which can flexibly adjust system performance under different requirements to achieve better performance optimization.
The hierarchical structure enables H-DDPG to dynamically adjust its policy to adapt to changes in different environments and network states.This flexibility can be important in real networks, especially when network conditions are constantly changing, and H-DDPG can respond to those changes, thus achieves an improved EE performance.It adjusts the power allocation strategy according to the current network state and UE needs and gradually improves the SE of each UE.Moreover, the algorithm efficiently controls the transmission power of each AP to guarantee system stability and feasibility, consequently mitigating potential interference issues.H-DDPG algorithms consider power allocation collaboratively on a global scale to maximize total EE, rather than just local optimization, which helps to make more efficient use of limited resources.
Figure 8 illustrates the cumulative distribution function of EE per UE for various methods.From Fig. 8, it is evident that the H-DDPG method exhibits the outstanding EE performance, the AP-DDPG method shows a slightly inferior performance, and the S-DDPG method demonstrates the poorest EE performance.This is mainly because the H-DDPG method makes full use of the knowledge sharing and reuse mechanism brought by the hierarchical structure, so that it can make more accurate decisions.In the hierarchical structure, agents at different levels can share information, and Fig. 7 Average number of UEs whose SE exceeds the threshold the policies at lower levels can be guided by the policies at higher levels, thus the knowledge of the whole system can be transferred between different levels, effective policies can be learned faster.This provides a more comprehensive understanding of the entire system state, enabling more precise optimization.Therefore, H-DDPG can learn strategies more efficiently, especially in the case of a reasonably divided hierarchy, and it usually has faster convergence.
As shown in Fig. 9, we can observe the effect of the relative position between the AP and the UE on the system EE over time.It is worth noting that the H-DDPG method consistently performs well in terms of the EE performance, consistently outperforming other methods.In comparison, AP-DDPG method, although slightly inferior to  H-DDPG, still ranks second and shows the good performance.In addition, S-DDPG method is inferior to H-DDPG and AP-DDPG in the EE performance, it is still superior to the traditional algorithm.
These observations highlight the effectiveness of the H-DDPG method.The reason for this difference is that H-DDPG breaks down the joint optimization problem into multiple levels, each focused on a different task or goal.This hierarchical control strategy enables agents to deal with multi-objective problems more effectively, and decomposes complex tasks into simpler subtasks, thus improving the manageability and execution efficiency of tasks.Moreover, H-DDPG can balance the capacity of exploration and utilization.Each level of strategy can play a role in different stages of exploration and utilization.High-level policies are generally more focused on the global exploration and general decision making, while low-level policies can be explored and adjusted in more detail.This facilitates to explore the environment during the learning process, while allowing for more refined strategic adjustments.

Conclusion
In this paper, we designed an energy-efficient AP clustering and power allocation for CF-mMIMO systems.To deal with the non-convex optimization problem, a H-DDPG algorithm is proposed.This algorithm takes the advantage of both DRL and hierarchical optimization to create a collaborative optimization framework which effectively handle AP clustering and power allocation.In particular, the original optimization problem first was divided into two sub-problems, and then the corresponding DDPG agents are trained separately.During the training phase, these agents learn from the environment, gradually improving their strategies to maximize the system EE.Simulation results clearly demonstrate that the proposed H-DDPG method leads to a substantial enhancement in the system's EE, compared to benchmark methods.Moreover, this study provides a new research perspective for resource allocation in CF-mMIMO systems, that is, by dividing the problem into small sub-problems in multiple stages, the problems at different levels can be optimized more flexibly.For the further research, in light of the centralized information exchange in single-agent DRL-based algorithms, it is essential to develop multi-agent DRL algorithms for resource allocation in CF-mMIMO systems.Also, the joint design of AP clustering and UE scheduling, as well as beamforming is promising to satisfy future massive connectivity in real-world applications.

Page 2 of 25
Tan et al.EURASIP Journal on Advances in Signal Processing (2024) 2024:18

Fig. 8
Fig. 8 CDF of the EE per UE under different schemes with L = 12 and K = 8

Table 1
Environmental parameters