Unmanned aerial vehicles (UAVs) have been extensively researched and deployed in both military and civilian applications due to their tiny size, low cost, and great ease. Although UAVs working together on complicated jobs can significantly increase productivity and reduce costs, they can cause major issues with path planning. In complex environments, the path planning problem, which is a multiconstraint combinatorial optimization problem and hard to settle, requires considering numerous constraints and limitations and generates the best paths for each UAV to accomplish group tasks. In this paper, we study the path planning problem for multiple UAVs and propose a reinforcement learning algorithm: PERDE-MADDPG based on prioritized experience replay (PER) and delayed update skills. First, we adopt a PER mechanism based on temporal difference (TD) error to enhance the efficiency of experience utilization and accelerate the convergence speed of the algorithm. Second, we use delayed updates in the process of updating network parameters to ensure stability in training multiple agents. Finally, we propose the PERDE-MADDPG algorithm based on PER and delayed update skills, which is evaluated against the MATD3, MADDPG, and SAC methods in simulation scenarios to confirm its efficacy.

1. Introduction

Currently, with the continuous development of aviation technology and artificial intelligence [1], unmanned aerial vehicles (UAVs) have been widely used in target reconnaissance [2], communication relay [3], geographic surveying and mapping [4], logistics, and distribution [5]. However, as application environments become more complex, traditional single UAV performs poorly in terms of adapting to the environment and achieving low mission success rates for complex operational tasks [6]. Fortunately, the collaborative work of multiple UAVs effectively mitigates many drawbacks of a single UAV in complex environments and significantly improves the success rate of mission execution. Multi-UAV cooperative work has the characteristics of low cost, good scalability, and strong adaptability, which has become the mainstream direction of UAV application [7].

The basis of multi-UAV cooperative work is the multi-UAV path planning problem [8], which generates an optimal path for each UAV to reach its destination, taking into account various constraints, such that the total path planned is the shortest [9]. Especially in dynamic environments with partial or no prior knowledge, collaborative path planning of multiple UAVs requires real-time adjustment of planned paths based on environmental changes to avoid collisions with other flying objects, which is a highly complex and challenging problem. Path planning for multiple UAVs is essentially a multiconstraint combinatorial optimization problem [10], and the nature of combinatorial optimization makes it become a strongly NP-hard complexity problem [11]. Moreover, the presence of multiple environmental constraints and the varying performance capabilities of the UAVs contribute to the increased complexity in achieving optimal solutions.

The performance limitations of a single UAV must be taken into account during the actual path planning process, in addition to the limitations of objective conditions like the job task and its environment [12]. As demonstrated in Figure 1, path planning for UAVs entails getting the group job specifications and quickly computing a workable path for every UAV. This process takes into full consideration the significant influencing factors, including both static and dynamic factors. The static factors encompass UAV capabilities, region scope, and fixed obstacles. On the other hand, the dynamic factors involve cooperative relationships, communication networks, and moving obstacles. The cooperative path planning of UAVs is aimed at efficiently generating feasible pathways for each UAV while ensuring that the UAVs can efficiently accomplish the cluster mission.

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

Limitations in multi-UAV path planning.

With the extensive application of large-scale UAV collaborative work, multi-UAV path planning has gained widespread attention and has become a hot topic in the current research field of UAV technology. Researchers with diverse backgrounds have studied and investigated multi-UAV path planning issues with various restrictions and objectives in recent years. However, due to the multitude of constraints, immense computational requirements, and the complexity of UAV motion models, most previous studies have simplified the UAV constraints and applied them to relatively simple static environments. As a result, they have been unable to generate optimal solutions for path planning in real-time complex environments. Luckily, deep reinforcement learning techniques based on trial-and-error learning mechanisms have been widely applied to address multi-UAV path planning problems. These techniques adapt to dynamically changing environments and generate optimal paths for each UAV, thanks to the rapid advancements in artificial intelligence technology. Compared to traditional methods, deep reinforcement learning approaches have significant advantages in complex environments, demonstrating stronger robustness and enabling true collective intelligence [13].

However, despite the potential of deep reinforcement learning algorithms in addressing multi-UAV path planning problems, they face several challenges. One of these challenges is how to effectively utilize valuable experiences [14]. During the training process, deep reinforcement learning algorithms need to interact with the environment and extract useful knowledge from experiences. However, due to the complexity and diversity of path planning problems, effectively leveraging experiences becomes quite difficult. Furthermore, deep reinforcement learning algorithms are prone to instability during the training process [15]. This means that the algorithm may experience fluctuations and inconsistencies, leading to slower convergence speeds.

In this study, we propose a prioritized experience replay (PER) mechanism based on temporal difference (TD)–error to address the issue of effectively utilizing experience. In addition, we propose a delayed update skill to address the issue of unstable and divergent updates during the agents’ training process. Combining the above two points, we propose the PERDE-MADDPG algorithm to address the path planning issue for UAVs. The following is a summary of this study’s significant contributions:

1.
In order to efficiently utilize the more valuable experiences, we employed a PER mechanism instead of the traditional experience replay mechanism, which accelerates the convergence of the algorithm.
2.
We have adopted a delayed update strategy to address the issue of unstable updates during the agents’ training process. Specifically, after a certain number of updates to the critic network, we update the policy network, target critic network, and target policy network once, which helps stabilize the training process and mitigate the instability associated with frequent updates.

The remainder of this work is organized as follows. Section 2 reviews related work of path planning. In Section 3, the cooperative path planning of multiple UAVs is abstracted as a multiconstraint combinatorial optimization problem, and its Markov model is established. Section 4 presents the PERDE-MADDPG algorithm. In Section 5, simulation experiments are carried out, and the proposed algorithm’s efficacy is confirmed. Section 6 summarizes the work of this paper and future research directions.

2. Related Work

Path planning is aimed at designing the optimal path for each drone to follow during its flight [16], which is the main key reason why multiple UAVs can be successfully deployed in different domains [17]. Path planning algorithms can generally be categorized into three main groups: sampling-based methods, heuristic algorithms, and deep reinforcement learning algorithms.

Probabilistic Roadmap Method (PRM) [18] and Rapid-exploration Random Tree (RRT) [19] are the main sampling methods. The sampling method avoids the need to model the entire environmental space and instead reconstructs the environment using sampled points, resulting in relatively lower computational intensity. For instance, Yan, Zhuang, and Xiao [20] proposed a real-time path planning technique for UAV in intricate 3D environments. A modified PRM is introduced in this work by random sampling in a bounding box array. Song et al. [21] proposed an improved Rapid-exploration Random Tree (RRT) path planning algorithm for the application of automated land vehicles (ALVs), which combines the no logic constraints of the vehicle and the double extended RRT to not only improve the searching efficiency but also ensure the feasibility of the paths. Although sampling methods can handle path planning problems in high-dimensional environments, they rely on random sampling within the problem space, which makes it impossible to guarantee the quality of the obtained paths. As a result, the outcomes are frequently less than ideal.

Heuristic-based methods include genetic algorithm (GA) [22], particle swarm algorithm (PSO) [23], ant colony algorithm (ACO) [24], and others. Most heuristic algorithms search for spatially optimal solutions by mimicking group organisms’ behavior, such as foraging or roundup capture. These algorithms can solve high-dimensional, multiconstraint optimization problems. To address the issue of the traditional PSO algorithm easily falling into local optima, Huang et al. [25] proposed a novel PSO algorithm called ACVDEPSO. In this algorithm, the velocity of particles is transformed into cylinder vectors for the convenience of path searching. To overcome the main drawbacks of low search efficiency in solving path planning problems using the traditional ACO algorithm, Li, Xiong, and She [26] introduced variable pheromone enhancement factors and variable pheromone evaporation coefficients into the ACO algorithm, proposing a new algorithm called ACO-VP. Despite the capability of heuristic algorithms to find approximate optimal solutions in large-scale problems and effectively explore complex search spaces, they often suffer from slow convergence, complex operations, long computation time, and sensitivity to parameters.

Sampling methods and heuristic algorithms mostly deal with UAV cluster task allocation and path planning problems separately, ignoring the coupling between the two. Deep reinforcement learning algorithms are more task-oriented and guided by reward functions to accomplish task allocation, path planning, collision avoidance, and obstacle avoidance under the premise of satisfying various constraints. For the problem of slow convergence due to the low reward of training samples, Jun, Yunxiao, and Li [27] combined HER with DQN to increase the sample validity of the experience pool, thus improving the convergence speed. Han et al. [28] employed the artificial potential field approach to influence the DDPG algorithm’s action selection, which ultimately improved the path smoothness and shortened the path length. The benefits of using reinforcement learning algorithms to solve path planning problems include their capacity to adapt to complex environments and their adaptive and generalization abilities.

However, despite the various advantages of deep reinforcement learning algorithms in solving multi-UAV path planning problems, they still face some problems. Two significant challenges are how to efficiently utilize experience, as well as the instability and slow convergence speed during the training process. To address the challenge of effectively utilizing experience, we propose a PER mechanism based on TD-error. This mechanism assigns priority based on the TD-error of experience, reflecting the importance of each experience in learning and decision-making. In addition, we address the issues of instability and slow convergence during the training process by introducing a delayed update strategy. By using this delayed update strategy, the frequency of parameter updates during the training process can be reduced, thereby increasing the stability of the training.

3. Problem Formulation and System Model

In this section, we define and model the path planning problem of multiple UAVs. Firstly, the path planning problem is described as an optimization problem subject to multiple constraints. Then, in order to satisfy the optimization conditions, we construct a multiagent system based on the MADDPG framework.

3.1. Formulation of the Multi-UAV Path Planning Problem

In our work, a fleet of N homogeneous UAVs navigates from their respective starting points to predefined destinations by communicating and coordinating in the air, which need to navigate around M obstacles. Like most research studies, we assume that all UAVs fly at the same altitude, focusing primarily on horizontal control and ignoring the UAVs’ lift control. And we use P_i(t) = (x_i(t), y_i(t)), i ∈ [1, N] to represent the position of the UAV i at time t. Since the UAVs are homogeneous, their radii are all the same and can be represented by r. In addition, we use P_i = (x_i, y_i), i ∈ [1, N] to represent the destination position of UAV i, and , j ∈ [1, M] is used to represent the position of obstacle j. Meanwhile, we employ r_j, j ∈ [1, M] to denote the radius of obstacle j.

As shown in Figure 2, D_i(t) is the distance at time t from the current position of UAV i to the predefined destination; D_arrived represents the arrival threshold, where UAV i is considered to have reached its destination when D_i(t) is less than D_arrived and u_i(t) indicates its linear velocity at time t, ψ_i(t) denotes its heading angle which means the angle between the linear velocity and the x-axis direction at time t, the angle between the target line vector at time t and the velocity vector of UAV i is represented by α_i(t), which stands for the target velocity angle, respectively. ψ_i(t) and α_i(t) are within the range of (−π, π], and the angular velocity of UAV i at time t is denoted by w_i(t). And the following is a definition of its motion model:

()

The collaborative path planning problem in this work is aimed at creating an uninterrupted flight path for each UAV connecting its starting position and destination while avoiding collisions with other obstacles and UAVs. Each UAV should arrive at its specified position at a given time point. Therefore, a time-sensitive series of waypoints can be used to characterize the flight route of a specific UAV. The time-sensitive waypoint sequence for UAV i is represented by φ_i, which has the following representation:

()

where

represents the k-th time-sensitive waypoint, indicating that at time point

, UAV i should arrive at position

. And when UAV i reaches its destination at the arrival time T_i,

denotes the last waypoint. This study focuses on multi-UAV collaborative path planning, with the primary goal being the effective search for an ideal time-sensitive waypoint sequence φ_i for UAV i. The following three restrictions must be followed while creating the flight routes for UAVs:

1.
Collision avoidance constraint: Each UAV should avoid collisions with other UAVs or obstacles. We define the safety distance d_s between two UAVs as the sum of their radius, which is equal to d_s = 2r. The safety distance between UAV i and obstacle j is defined as the sum of the radius of UAV i and obstacle j, which is equal to . Specifically, at any time, the distance between UAVs should be not less than d_s, and the distance between UAV i and obstacle j should be not less than . Therefore, this constraint can be expressed as follows:

()

2.
Motion state update constraint: Each UAV motion state needs to be periodically updated by applying a new control vector through iterative steps. The UAV motion over short durations Δt can be considered as uniform linear motion. Based on the dynamic model shown in Equation (1), the following is an expression for this constraint:

()

3.
Destination arrival constraint: We consider a UAV to have reached its destination when the distance between the UAV and the endpoint is less than D_arrived. Moreover we use T_i to represent the time at which UAV i reaches its destination. This constraint can be expressed as follows:

()

The collaborative path planning problem can be described as the minimization of the total flight distance for all UAVs, as its primary objective is to find collision-free routes for each UAV to efficiently reach their respective targets. According to the waypoint sequence φ_i in Equation (2), L(φ_i) is a representation of the flight length of UAV i from its starting point to its endpoint.

It can be calculated using the following formula:

()

The collaborative path planning problem addressed in this paper is to find the optimal path for each UAV without collision, aiming to minimize the total path length of all UAVs. It can be expressed as follows:

()

3.2. Building Multiagent System Based on MADDPG

Because the path planning problem of multiple UAVs is a normal sequential choice problem, stochastic games (also known as Markov games) can be used to simulate it. According to the Markov decision process, the state information of the UAV cluster is represented as the five-tuple <S, A, P, R, γ>.

In the five-tuple <S, A, P, R, γ>, S denotes the set of states of the environment. A = {a₁, a₂, ⋯, a_N} represents the joint action information of UAVs. P : S × A × S^′⟶[0, 1] is the state transfer function, which represents the probability of transitioning the environmental state from the current state S to the next state S^′ by taking action a joint action A.R = {R₁, R₂, ⋯, R_N} denotes the reward function that describes the effect of the joint action A taken by UAVs in the current state S. γ ∈ (0, 1] is a discount factor, which adjusts the weight between short-term and long-term returns.

Because of the characteristics of fast speed, short response time of UAVs, each UAV can only observe a portion of the environment. Then, a partially observable Markov game problem can describe the path planning problem, characterized by a tuple

, where

is the set of local observations for the UAVs.

1.
Observation space : Due to the limited observation range of UAVs, they can only detect nearby obstacles and other UAVs. Therefore, the local information o_i observed by UAV i consists of three parts, its own position, the positions of surrounding UAVs, and the positions of neighbouring obstacles, which is described as

()

where N_i and M_i represent the sets of UAVs and obstacles observed by UAV i, respectively.

2.
Action space A: Since each UAV is homogeneous in this work, they have the same action space. We use a = {a₁, ⋯, a_N} to represent the set of actions for the UAVs. According to Equation (1), pose change of UAV i is controlled by linear velocity u_i and angular velocity w_i. As a result, action space of UAV i can be described as

()

3.
Reward function R: In reinforcement learning, the reward function is a crucial component that quantifies the effectiveness of an agent’s behavior. A well-designed reward function often incorporates valuable human experience, enabling agents to acquire more meaningful knowledge during the learning process. Reward functions are usually set up to make it as easy as possible for an agent to achieve its goals. In this paper, the goal to be achieved by the intelligent agent is to reach the specified goal without collision.

Therefore, there are three termination conditions for UAVs: reaching the destination point, colliding with other UAVs or obstacles, and reaching the maximum number of steps without reaching the destination point.

In this work, the objective of the UAV is to reach the designated targets without experiencing any collisions. Thus, when UAV i reaches the destination point, a reward

should be given:

()

where R_g is a predefined positive number.

When UAV i collides with other UAVs or obstacles, it should be penalized. We assume that at the current time step, UAV i collides with other UAVs or obstacles n times. The collision penalty

it receives can be calculated:

()

where R_c is a predefined negative number and n is the number of collisions that have occurred at the current moment.

When the maximum number of steps is reached and UAV i has still not reached its destination, a penalty

should be given. Then,

can be calculated via

()

where R_s is a predefined negative number.

During the flight process of a UAV, when it moves away from the destination point, it should be penalized. We use dis to represent the distance between UAV i and its target point at the current moment. Then, the penalty

can be represented as

()

Based on the above description, the total reward for UAV i can be calculated using the following formula:

()

where w_g, w_c, w_s, and w_a are weighting factors for four types of rewards, respectively.

4. The PERDE-MADDPG Approach

In this section, we propose the PERDE-MADDPG algorithm to solve the path planning problem of multiple UAVs. The PERDE-MADDPG algorithm builds upon the MADDPG algorithm and introduces a PER mechanism based on sample priority, which replaces the traditional empirical playback mechanism. And this algorithm incorporates delayed update skills to reduce the update frequency of the actor network and target network parameters, which helps mitigate the impact of parameter instability.

In this approach, each UAV is considered as an autonomous agent that intelligently learns a reasonable path selection strategy and executes optimal actions based on its current local observations. Similar to the MADDPG algorithm, the proposed approach employs a system with centralized training and distributed execution. In the decentralized execution phase, agents have the capability to share their experiences and iteratively optimize their policies through interactions with the environment and other agents. During the phase of decentralized execution, each agent independently takes actions according to its observations based on its policy.

The PERDE-MADDPG algorithmic architecture is shown in Figure 3, which can be divided into three main parts: environment, UAVs, and replay buffer D. Below, a detailed explanation will be provided for each section.

First, the environment module serves as a simulation environment for the path planning problem. It encompasses all the elements that the UAVs need to perceive and interact with. The environment provides state information, such as the current position and velocity, to the UAVs. Through interactions with the environment, the UAVs execute actions and receive reward signals, enabling them to learn adaptive path planning strategies. The environment module plays the role of facilitating the interaction between the UAVs and provides necessary information and feedback.

Next, the UAV module is the core of the system, where each UAV is represented by an intelligent agent responsible for generating optimal paths for each UAV. The agent utilizes deep reinforcement learning to select actions based on the environment and experiences stored in the pool. Each agent consists of four networks: actor, critic, target actor, and target critic. The actor network chooses actions based on the environment state. The critic network scores the state-action pairs for path planning. The target networks stabilize the training process. During network updates, we employ a delayed update strategy on top of soft updates, where the critic network is updated multiple times while the actor and target networks are updated once. This reduces the update frequency of the actor and target networks, making the training more stable. During the training process, the agents update their parameters based on reward signals and environmental feedback, continuously improving their path planning strategies.

Finally, the replay buffer module is used to store the UAVs’ experiences in the environment and provide samples for the UAV module during training. The replay buffer adopts the PER method, assigning priorities to experiences and sampling and updating them based on these priorities. As a result, experiences that have a greater impact on path planning are selected more frequently for training, thereby improving the efficiency and performance of the path planning algorithm. Through PER, UAVs can better leverage their previous experiences to optimize their path planning strategies.

And the following can be used to characterize each agent’s training process: (1) By interacting with the environment, each agent obtains its local observation o_i and uses the actor network to determine its optimal action a_i. (2) All agents execute their optimal actions, and the environment transitions to the next state S^′. (3) Experience transitions, defined as tuples {o, a, r, o^′}, are stored in an experience replay buffer D. (4) From the replay buffer D, K experiences are sampled based on sample priorities to update the network parameters. (5) The critic network is updated by minimizing the loss. (6) After multiple updates to the critic network, the actor network is updated once using gradient ascent, while the target actor network and target critic network are updated once using a soft update method.

4.1. PER

Experience replay methods were initially introduced in DQN algorithms. The two key design considerations for experience replay methods revolve around how to store experiences and how to effectively replay them. The regular DQN algorithm uses uniform sampling, treating each experience equally without considering their varying importance. However, in reality, different experiences have different levels of importance for the agent’s learning. Solely relying on uniform sampling fails to fully leverage the experience data and maximize the learning effectiveness of the agent.

PER is a good solution to the problem of not utilizing experience efficiently. The improvement of PER methods is how to replay those experiences. The PER method abandons the random sampling approach of classical experience replay and uses the size of the TD-error to measure the priority of a set of experiences. The importance of the samples can be measured using the TD-error in the time-difference method, where experiences with larger TD-errors are given higher priority, and on the contrary, experiences with smaller TD-errors are given lower priority. The TD-error expression for experience j is given below:

()

In order to break the random sampling criterion, the probability of drawing experience j is defined as

()

where α is a number from zero to one to control the random sampling and greedy sampling and D_j is the prioritization of the experience j precedence.

()

where ϵ is a small number to prevent δ_j from being zero.

Since prioritized empirical playback introduces TD-error, it alters the sample distribution in an uncontrolled form. This alteration might make the neural network’s training process more prone to oscillations or even divergence. To address this problem, the design of importance sampling weights ω_j is as follows:

()

where S is the current number of experience in the replay buffer and β is the importance sampling parameter, which determines the degree of influence of prioritized empirical playback on the sample distribution. If β = 1.0, it degenerates into classical experience playback.

Normalization is generally required in the program, so each sample weight is divided by the largest weight to derive the formula:

()

where i is the serial number of the sample with the highest sample weight.

The new loss function expression of agent i’s critic network after adding the sample priority is as follows:

()

where i is the serial number of the agent, j is the serial number of the experience, and K is the size of the minibatch.

4.2. Delayed Update

In the MADDPG method, each agent has an actor network and a critic network. Using the local observation o_i, the actor network generates the action a_i in accordance with the policy π_i. The action a_i chosen by the actor network is assessed by the critic network using an action-value function Q_i(·).

In addition, the target network is an important technique in MADDPG for improving data utilization and algorithm stability [29]. The target network refers to the target actor network and target critic network, which have the same structure as their respective actor and critic networks. In reinforcement learning, the computation of target values can be affected by the fluctuations in the current estimated Q-values, leading to training instability. Introducing target networks helps mitigate the variability of target values, thus improving stability.

In this work, we use , , , and to represent the parameters of the actor network, the critic network, and the corresponding target networks of agent i, respectively.

To update the parameters

of the actor network, we utilize the gradient ascent method to maximize the cumulative reward:

()

where o = {o₁, ⋯, o_N}, K is the batch size, and

is the centralized action-value function of agent i.

In traditional MADDPG methods, during the training process, the network parameters of the actor network, critic network, and their corresponding target networks are updated at a certain frequency. However, this frequent updating of the network parameters can lead to oscillations and instability in the agent’s policy learning, causing the agent to lose direction in its policy learning, which can result in unstable agents and occasional policy oscillations in practical applications. To address this issue, we employ delayed update skills. Specifically, after a certain number of updates to the critic network, we update the actor network, and target networks once, reducing the frequency of updates to the actor network and its corresponding target network and stabilizing the learning process.

After a certain number of updates to the policy network, we update the critic network by minimizing the loss function

based on Equation (20), and by utilizing the updated parameters

and

, the target network parameters are softly updated as follows:

()

where τ is the inertia factor and τ ≪ 1.

Algorithm 1: PERDE-MADDPG algorithm.

Input: the total number of episodes E_max, the maximum number of steps per episode T , the batch size K, the replay buffer D, the priority parameter α , the importance sampling parameter β,the update interval φ, the delayed update interval ω
1:For i= 1 to N do
2: Initialize the network parameters of the actor, critic, and corresponding target networks of UAV i;
3: End for
4:For episode = 1 to E_maxdo
5: Initialize a random noise process for exploration;
6: Get initial state o = (o₁, ⋯, o_N);
7: For t = 1 to T do
8: For i= 1 to N do
9: choose action ;
10: End for
11: Execute joint actions a = (a₁, ⋯, a_N);
12: Get reward r and new state o^′;
13: Store {o, a, r, o^′} in replay buffer D and set P_t = max_i<tP_i;
14: If t mod φ = 0 then
15: For i= 1 to N do
16: For j= 1 to K do
17: Experience j is sampled with probability P_j from D according
to Eq. 16;
18: Calculate corresponding δ_j according to Eq.15 and ω_j
according to Eq. 19;
19: Update the priority P_j according to Eq. 17;
20: End for
21: Update critic network parameters by minimizing the loss function as
Formula (20);
22: If t mod ω= 0 then
23: Update actor network parameters using Formula (21);
24: Update target network parameters using Formula (22);
25: End if
26: End for
27: End if
28: End for
29: End for

According to the previous discussions, the proposed PERDE-MADDPG method can be summarized as Algorithm 1. The process of the PERDE-MADDPG algorithm is as follows: First, we initialize the network parameters of the actor, critic, and corresponding target networks of UAV i (Lines 1–3). Then, the training process begins from Line 4. The training process is divided into E_max episodes, with T steps per episode. At the start of each round, we obtain the initial noise process and environment state o (Lines 5–6). In each step, each UAV selects its action a_i based on its own observations o_i, using the policy network and incorporating a certain amount of exploration noise (Lines 7–10). After the joint actions are completed (Line 11), we obtain the new rewards r and observations o^′ generated by each UAV (Line 12) and store the experience transitions {o, a, r, o^′} in the experience pool D and set its probability to the maximum probability to ensure that the new experience is extracted at least once (Line 13). When the update frequency is reached, for each UAV, draw K experiences from the experience pool D according to the probability of the experience; for each experience j, calculate its corresponding δ_j and ω_j according to Equations (15) and (19), and update its probability P_j according to Equation (17) (Lines 14–20). The critic’s current network is updated using Formula (20) (Line 21). Once the delayed update frequency is reached, update the network parameters of the actor network, target critic network, and target actor network (Lines 22–25). Finally, once the total number of episodes E_max is reached, the training process is stopped.

5. Experiments and Results

In this section, we conduct experiments in the MPE (multiagent particle environment) simulation environment to validate the effectiveness of the proposed PERDE-MADDPG algorithm. We evaluate our method’s performance by contrasting it with three widely used reinforcement learning algorithms: MATD3 [30], MADDPG [31], and SAC [32]. In the following sections, we first explore the settings of parameters in our proposed algorithm and then test the performance gains of our method based on average reward, arrival rate, and path length.

5.1. Parameter Setting

The four approaches’ parameters correspond to the configurations found in [33, 34], as shown in Table 1. The effectiveness of the reinforcement learning algorithms is strongly influenced by the number of neurons in the neural networks’ hidden layers as well as the learning rate of each network. Thus, in order to figure out the ideal values for later trials, we first carry out tests to assess the effects of the number of neurons n and the learning rate ρ.

Table 1. Parameter configuration in the algorithm.

Parameters	Value
Batch size K	1024
Replay buffer D	1e6
Update interval φ	50
Soft update rate τ	1e − 2
Priority parameter α	0.6
Reward discount factor γ	0.95
Delayed update interval ω	100
Number of steps in each episode T	100
The total number of episodes E_max	50,000

In the experiments shown in Figures 4 and 5, there are 4 UAVs and 12 obstacles, and their positions are randomly generated. Each UAV flies towards its destination while avoiding collisions with obstacles and other UAVs. Figures 4 and 5 display the mean episode reward (every 1000 episodes) in the training process of the PERDE-MADDPG method.

As shown in Figure 4, the number of neurons in the hidden layers changes from 32 to 256. From Figure 4, it can be observed that our method achieves the shortest convergence time and highest reward when the number of neurons is set to 64. As a result, setting the neuron number n to 64 is advised. Similarly, Figure 5 shows the average reward when the learning rate ρ increases. It can be seen that when the learning rate is set to 0.01, the algorithm converges faster and achieves a higher reward. Therefore, we set the learning rate ρ to 0.01.

5.2. Mean Reward Evaluation

Under the parameter settings mentioned earlier, we evaluate the performance of our method based on mean reward, which is the mean value of the rewards obtained per episode during the evaluation process. The average reward is a significant indicator of reinforcement learning algorithms’ learning efficiency and optimization capability. Algorithms that achieve higher mean reward are considered more efficient and accurate. In this subsection, we sum up the rewards of the agents over 100 evaluate episodes, calculate the mean reward by dividing the sum by 100, and evaluate the performance of different methods by comparing their average rewards.

Figure 6 displays the average rewards obtained by the four reinforcement learning methods under different scenarios, where the number of obstacles is fixed at 12, and the number of UAVs ranges from 2 to 6. From the graph, it can be observed that as the number of UAVs increases, the average rewards for each method also increase. This is because, with an increasing number of UAVs, although the number of times each UAV reaches its destination decreases, the overall number of successful UAV arrivals increases, resulting in higher rewards. Furthermore, we can also observe that regardless of the number of UAVs, our proposed PERDE-MADDPG algorithm achieves higher mean reward than the other three algorithms. For instance, when the number of UAVs is 5, our algorithm obtains a reward of 52.70, which is an improvement of 10.43% compared to MATD3, 17.27% compared to MADDPG, and 26.74% compared to SAC.

Figure 7 illustrates the average rewards obtained by the four reinforcement learning methods under different scenarios, where the number of UAVs is fixed at 2 and the number of obstacles ranges from 12 to 20. From the graph, it can be observed that as the number of obstacles increases, the average rewards for each method decrease. This is because, with more obstacles, the likelihood of collisions between UAVs increases, making it more challenging to reach their destinations and resulting in lower rewards. Additionally, we can also observe that regardless of the number of obstacles, our proposed PERDE-MADDPG algorithm achieves higher mean reward than the other three algorithms. For instance, when the number of obstacles is 20 and the number of UAVs is 6, our algorithm obtains a reward of 18.75, which is an improvement of 7.10% compared to MATD3, 15.96% compared to MADDPG, and 25.18% compared to SAC.

5.3. Arrival Rate Evaluation

The arrival rate is an important metric for evaluating path planning algorithms. We evaluate the trained algorithm in different scenarios for 100 episodes. Each time all UAVs reach their destinations, we increment the counter for successful episodes. Therefore, the arrival rate of the algorithm in a particular scenario is defined as the number of successful episodes divided by 100.

Figure 8 shows the arrival rate of four different methods in various scenarios where the number of obstacles is fixed at 12 and the number of UAVs ranges from 2 to 6. Based on the graph, it is evident that as the number of UAVs increases, the success rates of all four algorithms demonstrate a comparable pattern. This is because, with more UAVs, the likelihood of collisions between them increases, resulting in a decrease in the number of successful episodes. Additionally, the graph also indicates that our algorithm outperforms the other three methods when the number of UAVs is fixed. For instance, when the number of UAVs is five, the success rate of the PERDE-MADDPG algorithm is 40.00%, which is 10.00%, 13.00%, and 16.00% higher than that of MATD3, MADDPG, and SAC, respectively.

Figure 9 illustrates the arrival rate of four different methods in various scenarios where the number of UAVs is 2 and the number of obstacles ranges from 12 to 20. From the graph, it is clear that the arrival rate of all four algorithms falls as the number of obstacles rises. This is because with more obstacles, the scenarios become more complex, and the likelihood of collisions between UAVs and obstacles increases, resulting in a decrease in the success rate. Additionally, the graph also demonstrates that regardless of the number of obstacles, our algorithm outperforms the other three methods. For instance, when the number of obstacles is 20, the success rate of the PERDE-MADDPG algorithm is 70.00%, which is 9.00%, 14.00%, and 17.00% higher than that of MATD3, MADDPG, and SAC, respectively.

5.4. Path Length Evaluation

In this section, we will validate the performance of our method by evaluating the total length of the planned paths. As stated in Equation (7), the objective of multi-UAV path planning is to minimize the total length of the planned paths. Therefore, a direct evaluation of path planning methods is to compare the total lengths of the planned paths. A more accurate method will generate shorter path lengths for the UAVs.

We will evaluate the trained algorithm in different scenarios for 100 episodes. If, in an evaluating episode, all UAVs reach their destinations, we will record their total path length. The path length of the planned paths in different scenarios is then defined as the total path length of all UAVs in successful episodes divided by the total number of successful episodes.

Figure 10 shows the path lengths of four methods in the experiment with 12 obstacles and a varying number of UAVs from 2 to 6. It can be seen that as the number of UAVs increases, the path lengths of all methods also increase. This is because the total flying distance of UAVs increases with the number of UAVs. Additionally, as the number of UAVs increases, the probability of collisions between UAVs also increases. To avoid collisions, UAVs tend to choose longer paths, thus increasing the flying distance. From Figure 10, it can be observed that regardless of the number of UAVs used in the experiment, our PERDE-MADDPG method consistently plans paths with shorter lengths compared to the other methods. When the number of UAVs is five, our method achieves a path length of 89.78 km, which is 7.64%, 10.13%, and 13.76% shorter than the path lengths produced by the MATD3, MADDPG, and DDPG algorithms, respectively.

Figure 11 displays the path lengths of four methods in the experiment with a fixed number of 2 UAVs and a varying number of obstacles from 12 to 20. It is evident that as the number of obstacles increases, the path lengths of all methods also increase. This is because as the number of obstacles increases, the likelihood of collisions between UAVs and obstacles also increases. UAVs need to navigate through more complex paths to avoid collisions with obstacles. From Figure 11, it can be observed that our PERDE-MADDPG method consistently plans paths with shorter lengths compared to the other methods, regardless of the number of obstacles. When the number of obstacles is 16, our method achieves a path length of 35.2 km, which is 5.98%, 10.19%, and 14.62% shorter than the path lengths produced by the MATD3, MADDPG, and SAC algorithms, respectively.

6. Conclusions

This paper proposed a path planning algorithm: PERDE-MADDPG based on PER and delayed update skills. When selecting experiences from the experience pool, we employed a PER mechanism to utilize valuable experiences efficiently. On this basis, we have adopted a delayed update strategy to address the issue of unstable updates during the agents’ training process. Experimental results show that our proposed PERDE-MADDPG algorithm outperforms better than the MATD3, MADDPG, and SAC algorithms in terms of mean reward, arrival rate, and planned path lengths under different environments. However, when the number of drones becomes too large, the PERDE-MADDPG algorithm also suffers from difficulties in convergence and poor training performance. So, we intend to further incorporate attention mechanisms in the future to boost the PERDE-MADDPG algorithm’s performance.

Conflicts of Interest

The authors declare no conflicts of interest.

Funding

This work was supported by the National Natural Science Foundation of China (No. 62106202), the Key Research and Development Program of Shaanxi (No. 2024GX-YBXM-118), the Aeronautical Science Foundation of China (No. 2023M073053003), and the Fundamental Research Funds for the Central Universities.

Acknowledgments

Open Research

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

References

1 Fan B., Li Y., Zhang R., and Fu Q., Review on the technological development and application of UAV systems, Chinese Journal of Electronics. (2020) 29, no. 2, 199–207, https://doi.org/10.1049/cje.2019.12.006.
10.1049/cje.2019.12.006
Web of Science® Google Scholar
2 Tan Y., Ni Y., Xu W., Xie Y., Li L., and Tan D., Key technologies and development trends of the soft abrasive flow finishing method, Journal of Zhejiang University-Science A. (2023) 24, no. 12, 1043–1064, https://doi.org/10.1631/jzus.A2300038.
10.1631/jzus.A2300038
Google Scholar
3 Li L., Xu P., Xu W., Lu B., Wang C., and Tan D., Multi-field coupling vibration patterns of the multiphase sink vortex and distortion recognition method, Mechanical Systems and Signal Processing. (2024) 219, article 111624, https://doi.org/10.1016/j.ymssp.2024.111624.
10.1016/j.ymssp.2024.111624
Google Scholar
4 Feng T., Research on application of UAV remote sensing surveying and mapping technology in engineering surveying and mapping, International Journal of Geology. (2021) 3, no. 1.
Google Scholar
5 Li Y., Liu M., and Jiang D., Application of unmanned aerial vehicles in logistics: a literature review, Sustainability. (2022) 14, no. 21, https://doi.org/10.3390/su142114473.
10.3390/su142114473
Google Scholar
6 Chang L., Wenjun X., Peng Z., Qing G., Zonghao X., and Chao G., UAV real-time route planning logical architecture in complex threat environment, Journal of Beijing University of Aeronautics and Astronautics. (2020) 46, no. 10, 1948–1957.
Google Scholar
7 Chen J., Zhang Y., Wu L., You T., and Ning X., An adaptive clustering-based algorithm for automatic path planning of heterogeneous UAVs, IEEE Transactions on Intelligent Transportation Systems. (2022) 23, no. 9, 16842–16853, https://doi.org/10.1109/TITS.2021.3131473.
10.1109/TITS.2021.3131473
Web of Science® Google Scholar
8 Ren C., Chen J., Du C., and Yu W., Path planning algorithm for multiple UAVs based on artificial potential field, 2023 IEEE 11th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), 2023, Chongqing, China, 970–974, https://doi.org/10.1109/ITAIC58329.2023.10408816.
10.1109/ITAIC58329.2023.10408816
Google Scholar
9 Chen J., Du C., Zhang Y., Han P., and Wei W., A clustering-based coverage path planning method for autonomous heterogeneous UAVs, IEEE Transactions on Intelligent Transportation Systems. (2022) 23, no. 12, 25546–25556, https://doi.org/10.1109/TITS.2021.3066240.
10.1109/TITS.2021.3066240
Web of Science® Google Scholar
10 Qin H., Zhao B., Xu L., and Bai X., Petri-net based multi-objective optimization in multi-UAV aided large-scale wireless power and information transfer networks, Remote Sensing. (2021) 13, no. 13, https://doi.org/10.3390/rs13132611.
10.3390/rs13132611
Google Scholar
11 Chen J., He Y., Zhang Y., Han P., and Du C., Energy-aware scheduling for dependent tasks in heterogeneous multiprocessor systems, Journal of Systems Architecture. (2022) 129, article 102598, https://doi.org/10.1016/j.sysarc.2022.102598.
10.1016/j.sysarc.2022.102598
Web of Science® Google Scholar
12 Chen J., Han P., Zhang Y., You T., and Zheng P., Scheduling energy consumption-constrained workflows in heterogeneous multi-processor embedded systems, Journal of Systems Architecture. (2023) 142, article 102938, https://doi.org/10.1016/j.sysarc.2023.102938.
10.1016/j.sysarc.2023.102938
Web of Science® Google Scholar
13 Junlan L., Wenbo Z., Hongbing J., and Mingzhe Z., A summary of UAV swarm path planning algorithm research, Aerospace Electronic Warfare. (2022) 38, no. 1, 9–12.
Google Scholar
14 Kang C., Rong C., Ren W., Huo F., and Liu P., Deep deterministic policy gradient based on double network prioritized experience replay, IEEE Access. (2021) 9, 60296–60308, https://doi.org/10.1109/ACCESS.2021.3074535.
10.1109/ACCESS.2021.3074535
Google Scholar
15 Wan K., Wu D., Li B., Gao X., Hu Z., and Chen D., Me-MADDPG: an efficient learning-based motion planning method for multiple agents in complex environments, International Journal of Intelligent Systems. (2022) 37, no. 3, 2393–2427, https://doi.org/10.1002/int.22778.
10.1002/int.22778
Google Scholar
16 Ergezer H. and Leblebicioglu K., Path planning for UAVs for maximum information collection, IEEE Transactions on Aerospace and Electronic Systems. (2013) 49, no. 1, 502–520, https://doi.org/10.1109/TAES.2013.6404117, 2-s2.0-84872362731.
10.1109/TAES.2013.6404117
Web of Science® Google Scholar
17 Aggarwal S. and Kumar N., Path planning techniques for unmanned aerial vehicles: a review, solutions, and challenges, Computer Communications. (2020) 149, 270–299, https://doi.org/10.1016/j.comcom.2019.10.014.
10.1016/j.comcom.2019.10.014
Web of Science® Google Scholar
18 Kavraki L. E., Kolountzakis M. N., and Latombe J.-C., Analysis of probabilistic roadmaps for path planning, IEEE Transactions on Robotics and Automation. (1998) 14, no. 1, 166–171, https://doi.org/10.1109/70.660866, 2-s2.0-0031998864.
10.1109/70.660866
Google Scholar
19 Karaman S., Walter M. R., Perez A., Frazzoli E., and Teller S., Anytime motion planning using the RRT, 2011 IEEE International Conference on Robotics and Automation, 2011, Shanghai, China, 1478–1483, https://doi.org/10.1109/ICRA.2011.5980479, 2-s2.0-84866711082.
10.1109/ICRA.2011.5980479
Google Scholar
20 Yan F., Zhuang Y., and Xiao J., 3D PRM based real-time path planning for UAV in complex environment, 2012 IEEE International Conference on Robotics and Biomimetics (ROBIO), 2012, Guangzhou, China, 1135–1140, https://doi.org/10.1109/ROBIO.2012.6491122, 2-s2.0-84876480776.
10.1109/ROBIO.2012.6491122
Google Scholar
21 Song J., Dai B., Shan E., and He H., An improved RRT path planning algorithm, Acta Electonica Sinica. (2010) 38, no. 2A.
Google Scholar
22 Li J., Huang Y., Xu Z., Wang J., and Chen M., Path planning of UAV based on hierarchical genetic algorithm with optimized search region, 2017 13th IEEE International Conference on Control & Automation (ICCA), 2017, Ohrid, Macedonia, 1033–1038, https://doi.org/10.1109/ICCA.2017.8003203, 2-s2.0-85029898278.
10.1109/ICCA.2017.8003203
Google Scholar
23 Foo J. L., Knutzon J., Kalivarapu V., Oliver J., and Winer E., Path planning of unmanned aerial vehicles using B-splines and particle swarm optimization, Journal of Aerospace Computing, Information, and Communication. (2009) 6, no. 4, 271–290, https://doi.org/10.2514/1.36917, 2-s2.0-66149125593.
10.2514/1.36917
Google Scholar
24 Dorigo M., Birattari M., and Stutzle T., Ant colony optimization, IEEE Computational Intelligence Magazine. (2006) 1, no. 4, 28–39, https://doi.org/10.1109/MCI.2006.329691.
10.1109/MCI.2006.329691
Web of Science® Google Scholar
25 Huang C., Zhou X., Ran X., Wang J., Chen H., and Deng W., Adaptive cylinder vector particle swarm optimization with differential evolution for UAV path planning, Engineering Applications of Artificial Intelligence. (2023) 121, article 105942, https://doi.org/10.1016/j.engappai.2023.105942.
10.1016/j.engappai.2023.105942
Google Scholar
26 Li J., Xiong Y., and She J., UAV path planning for target coverage task in dynamic environment, IEEE Internet of Things Journal. (2023) 10, no. 20, 17734–17745, https://doi.org/10.1109/JIOT.2023.3277850.
10.1109/JIOT.2023.3277850
Google Scholar
27 Jun W., Yunxiao Y., and Li L., Based mobile robot path planning based on improved deep reinforcement learning, Electronic Measurement Technology. (2021) 44, no. 22.
Google Scholar
28 Han Z., Mingyang X., Min Z., and Naiqi W., Path planning of mobile robot with fusion DDPG algorithm, Control Engineering of China. (2021) 28, no. 11, 2136–2142.
Google Scholar
29 Wang L., Wang K., Pan C., Xu W., Aslam N., and Hanzo L., Multi-agent deep reinforcement learning-based trajectory planning for multi-UAV assisted mobile edge computing, IEEE Transactions on Cognitive Communications and Networking. (2021) 7, no. 1, 73–84, https://doi.org/10.1109/TCCN.2020.3027695.
10.1109/TCCN.2020.3027695
Web of Science® Google Scholar
30 Ackermann J., Gabler V., Osa T., and Sugiyama M., Reducing overestimation bias in multi-agent domains using double centralized critics, 2019, https://arxiv.org/abs/1910.01465.
Google Scholar
31 Lowe R., Wu Y. I., Tamar A., Harb J., Pieter Abbeel O., and Mordatch I., Multi-agent actor-critic for mixed cooperative-competitive environments, Advances in Neural Information Processing Systems. (2017) 30.
Google Scholar
32 Haarnoja T., Zhou A., Abbeel P., and Levine S., Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor, International conference on machine learning, 2018, PMLR, 1861–1870.
Google Scholar
33 Chen J., Li T., Zhang Y., You T., Lu Y., Tiwari P., and Kumar N., Global-and-local attention-based reinforcement learning for cooperative behaviour control of multiple UAVs, IEEE Transactions on Vehicular Technology. (2024) 73, no. 3, https://doi.org/10.1109/TVT.2023.3327571.
10.1109/TVT.2023.3327571
Google Scholar
34 Chen J., Ling F., Zhang Y., You T., Liu Y., and Du X., Coverage path planning of heterogeneous unmanned aerial vehicles based on ant colony system, Swarm and Evolutionary Computation. (2022) 69, article 101005, https://doi.org/10.1016/j.swevo.2021.101005.
10.1016/j.swevo.2021.101005
Web of Science® Google Scholar

All articles

Prioritized Experience Replay–Based Path Planning Algorithm for Multiple UAVs

Abstract

1. Introduction

2. Related Work

3. Problem Formulation and System Model

3.1. Formulation of the Multi-UAV Path Planning Problem

3.2. Building Multiagent System Based on MADDPG

4. The PERDE-MADDPG Approach

4.1. PER

4.2. Delayed Update

5. Experiments and Results

5.1. Parameter Setting

5.2. Mean Reward Evaluation

5.3. Arrival Rate Evaluation

5.4. Path Length Evaluation

6. Conclusions

Conflicts of Interest

Funding

Acknowledgments

Open Research

Data Availability Statement

References

Figures

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley