Volume 2025, Issue 1 4810561
Research Article
Open Access

Graph Learning of Semantic Relations (GLSR) for Cooperative Multiagent Reinforcement Learning

Pengting Duan

Pengting Duan

School of Software , Northwestern Polytechnical University , Xi’an , 710072 , China , nwpu.edu.cn

North Automatic Control Technology Institute , Taiyuan , 030006 , China

Search for more papers by this author
Chao Wen

Chao Wen

Institute of Big Data Science and Technology , Shanxi University , Taiyuan , 030006 , China , sxu.edu.cn

Search for more papers by this author
Baoping Wang

Corresponding Author

Baoping Wang

School of Software , Northwestern Polytechnical University , Xi’an , 710072 , China , nwpu.edu.cn

Search for more papers by this author
Zhenni Wang

Zhenni Wang

School of Software , Northwestern Polytechnical University , Xi’an , 710072 , China , nwpu.edu.cn

Search for more papers by this author
Zhifang Wei

Zhifang Wei

Institute of Big Data Science and Technology , Shanxi University , Taiyuan , 030006 , China , sxu.edu.cn

Search for more papers by this author
First published: 19 May 2025
Academic Editor: Mohamadreza (Mohammad) Khosravi

Abstract

Prominent achievements of multiagent reinforcement learning (MARL) have been recognized in the last few years, but effective cooperation among agents remains a challenge. Traditional methods neglect the modeling of action semantic relations in the learning process of joint action latent representations. In other words, the uncertain semantic relations might hinder the learning of sophisticated cooperative relationships among actions, which may lead to homogeneous behaviors across all agents and their limited exploration efficiency. Our aim is to learn the structure of the action semantic space to improve the cooperation-aware representation for policy optimization of MARL. To achieve this, a scheme called graph learning of semantic relations (GLSR) is proposed, where action semantic embeddings and joint action representations are learned in a collaborative way. GLSR incorporates an action semantic encoder for capturing semantic relations in the action semantic space. By leveraging the cross-attention mechanism with action semantic embeddings, GLSR prompts the action semantic relations to guide mining the cooperation-aware joint action representations, implicitly facilitating agent cooperation in the joint policy space for more diverse behaviors of cooperative agents. The experimental results on challenging tasks demonstrate that GLSR attains state-of-the-art outcomes and shows robust performance in multiagent cooperative tasks.

1. Introduction

Cooperative multiagent reinforcement learning (MARL) has been widely applied in numerous fields, including autonomous driving [1], intelligent community services [2], traffic control [3], and intelligent UAV systems [47]. One of the examples is in the interactive warfare simulation system, as depicted in Figure 1. In the digital twin battlefield, the agents seek to explore diverse policies on large-scale missions training to enhance professional competence in real-world scenarios. To maximize a shared goal, multiple agents make iterative interactions in a common space, continually refining their policies by absorbing useful information and acquiring transferable skills from each other [8]. As the number of agents increases, the dimensions of joint action of multiagent systems (MAS) that interact with each other grow exponentially. From the viewpoint of an individual agent, the behavior selection strategies employed by other agents disrupt the stability of the environment. These defects prevent MARL algorithms from scaling to more agents [9].

Details are in the caption following the image
Interactive warfare simulation system in military.

To solve the poor convergence of independent policy gradient–based methods [10] in the aforementioned MAS, a plethora of trust region optimization algorithms, such as MAPPO [11] and HATRPO/HAPPO [12], have received notable successes by learning the monotonic policy updating scheme. One of significant aims in MAS is to address the reasonable credit assignment; value function decomposition–based methods [13, 14] decompose the global value function into a combination of local value functions according to the individual global max (IGM) principle [15], which simplifies the policy exploration and ensures that both the global and individual values are strictly monotonic. Policy gradient methods like MADDPG [16], COMA [17], and HAPPO [12] also have made efforts to encourage agent discriminability from the rest in reward distribution. As a general and feasible approach for credit assignment, the multiagent advantage decomposition [12] has promoted individualized policy exploration in cooperative MARL. The multiagent advantage decomposition as well as the value decomposition [18, 19] explicitly models the individual contributions of cooperative agents rather than the topological structure of cooperation space in MARL. The lack of learning with effective cooperation relation modeling may impede agents from diverse policy exploration in complex scenarios. SOG [20] and RODE [21] have been presented to fill out the gap. However, these methods follow a hierarchical learning framework and require manual parameter fine-tuning for joint action space decomposition through action clustering, which may largely limit their applications. Therefore, the research on automatic modeling of cooperative action relation is still insufficient in MARL, where the uncertain semantic relations may hinder the learning of sophisticated cooperative relationships among actions.

A natural solution to mitigating the uncertainty is to design auxiliary objectives to learn cooperative action representations and enrich the MARL exploration process. Hence, the effective understanding of semantic information related to task goals in complex scenarios is crucial for efficient cooperation. Different from attention-based learning for the importance distribution of each agent’s allies or GNN-based relationship learning [22], we concentrate on the graph modeling of structure information in action semantic space in order to diversify exploration for global optimal solutions in large-scale application scenarios with heterogeneous agents. The action semantic space can be established by the semantic embeddings of the heterogeneous actions and the structure of the space can be modeled by the correlation among the embeddings. The semantic embeddings capture the potential cooperation between heterogeneous actions, which is crucial for cooperation-aware modeling of multiagent interaction. This is distinguished from the specific priority setting [23] and interactive communication mechanism [24]. In addition, less attention has been paid to the collaboration between semantic encoding and cooperation-aware action representation for policy optimization.

In this paper, we propose a method named GLSR, i.e., graph learning of semantic relations, for cooperative MARL. The rich semantic dependencies among heterogeneous actions are encoded into action semantic embeddings by capturing the correlations. In a collaborative manner, the action semantics is encoded through graph modeling, while the cooperation-aware action representation is learned for policy optimization. The cooperative information among agent actions can be mined with correlation-aware action semantics, which is also enhanced by the backpropagation of joint action reward information. Each agent employs trust-region policy optimization theory and a cross-attention mechanism [25] to realize the behavior prediction, which can prevent the abrupt degradation in the joint action space. GLSR makes a first endeavor to develop a collaborative learning framework for both the semantic relation graph network and the multiagent actor-critic network.

The contribution of our paper is summarized as follows:
  • We propose graph learning of semantic relationships for cooperative MARL and introduce an innovative collaborative framework of action semantic relation encoding and latent representation learning for policy optimization.

  • We design a reasonable action semantic encoder from the perspective of graph learning. The correlation-aware action semantics provides guidance to the learning process of cooperation-aware latent representations through a semantic-guided actor network.

  • GLSR outperformed the state-of-the-art methods on the challenging StarCraft II Multiagent Challenge (SMAC) benchmark, especially in the superhard map scenarios requiring complex cooperative demands.

2. Related Work

2.1. MARL

MARL extents signal-agent RL and has attracted great attention for functioning well in more dynamic and complex tasks within the framework of Markov Games [26]. The policy of each agent can be independently optimized by directly adopting decentralized learning [27], but it may suffer from nonstability and poor convergence issues of joint policy [28]. MARL algorithms tend to optimize the agents’ action space by learning a joint policy, which can get rid of the instability in MAS [29]. To effectively overcome the problems of lazy agents and spurious rewards in the sharing reward mechanism [30], VDN [13] adopts a simple summation to generate the global value function without additional state information. QMIX [14] generates the weights of the mixing network through independent hypernetworks and limits the weights to positive values to ensure the monotonic constraints. By leveraging a more reliable decomposition technique induced by IGM, QTRAN [15] optimizes individual agent actions and relaxes the monotonicity restriction. However, the monotonicity constraint of these algorithms is a sufficient and unnecessary condition for the joint policy optimization, which may be limited to solving the “credit assignment” problem under completely equivalent cooperative relationships in cooperative tasks with monotonic benefits. To get rid of such limitation, multiagent advantage decomposition assumption [12] holds in general for Markov games, which provides a powerful credit assignment way in MARL. Based on the effective reward criterion, it is interesting to pay close attention to the modeling of cooperative actions and its contribution to performance from agents with interactive actions for dealing with complex scenarios.

2.2. Semantic Modeling for Cooperative MARL

A large number of MARL methods have been devised for cooperative scenarios, where the cooperation modeling under a sharing team reward plays an important role in improving convergence. The prevalent Centralized Training Decentralized Execution (CTDE) algorithms with Actor Critic (AC) network architecture [16, 17] have been developed through the combination of policy gradient with value function decomposition, such as MADDPG [16], MAPPO [11], HAPPO [12], etc. MADDPG stores the latent collective information like agent state transitions and rewards in a replay buffer during the training process to facilitate experience sharing among all agents. MAPPO and HAPPO handle agent interactions by exploiting the others′ observations and actions, which are also committed to modeling task-level semantics for cooperative exploration with strategies such as advantage function decomposition and shared network parameters [31, 32]. From the perspective of implicit semantic communication between agents, interactions among agents with independent action representation have been investigated for cooperative modeling [3336]. Furthermore, the effects of conflicting policy gradients among agents can be mitigated according to the trust region policy optimization theory, which induces a robust optimization framework with the monotonic improvement guarantee in various scenarios [37, 38]. The effectiveness of semantic feature has been proved to learn inherent structure and generalized relationship in link prediction [39], which facilitates the global task understanding. The semantic consistency inherent in state embedding has been exploited to efficiently capture the memory similar to the current state, which improves policy exploration [40]. However, the latent representation is guided by immutable prior knowledge of relation and the latent representation learning for policy optimization is lack of the collaboration with the dynamic semantic encoding, which inevitably results in the overfitting to specific teammates [41] and frequently suffers from suboptimal convergence traps in complex cooperative scenarios.

2.3. Collaborative Learning

It is significant for MARL to endow agents with the semantic knowledge to infer the actions of other agents, particularly in cooperative tasks. Effective relation modeling enables agents to efficiently cooperate for policy exploration. In addition to revealing the hidden relationship between actions and other agents′ policies, high-level modeling guided by semantic knowledge can be utilized to learn the potential representation of agent policy [42, 43]. However, the computation loads increase significantly with the number of agents, which seriously limits the applications in a large-scale heterogeneous MARL. To overcome this, a hierarchical collaborative learning framework has been investigated by decomposing joint action spaces into specific action spaces or role-related action spaces through clustering [21]. Different from the partition of action space for each agent based on manual construction of role and subtask, graph modeling has been adopted to efficiently construct the edge set of all agents with the joint actions of each agent pair [44]. In order to diversify policy exploration, evolutionary algorithms have been incorporated into MARL through the crossover of policy representations and random parameter perturbations [4547]. However, these methods primarily focus on learning action latent representation with respect to roles and environmental observations, lacking effective modeling of structure information in action semantic space. It has been demonstrated that the dependency between latent representation and semantics can be well captured by constructing a latent embedding space [48, 49]. The semantic correlations can be explicitly encoded into interdependent relation-aware embeddings by graph neural networks [45]. Owing to the great potentiality of relation representation learning, graph neural networks are promising to model action semantic relation in the MARL scenario. We make our first attempt on semantic relation encoding with collaboratively learning latent action representation for policy optimization.

3. GLSR for MARL

The overall illustration of the network architecture of the proposed GLSR is given in Figure 2. In the action semantic encoding module, the action semantic embeddings are learned in action co-occurrence semantic space through a graph autoencoder with an action relation graph. In the semantic-guided actor network, each agent’s optimal action is decoded from the cooperation-aware joint action representation, the learning of which is guided by the action semantic embeddings. The agents can only access their preceding observations encoded by the critic network before taking actions, which can be ensured by the masked cross-attention blocks. The semantic embeddings and the cooperation-aware action representations are learned in a collaborative way rather than in a two-stage procedure. In the collaborative framework, cooperative relation information can be incorporated into correlation-aware action semantic embeddings, which is also reinforced by the backpropagation of joint action reward information. The essential modules of GLSR will be described in detail.

Details are in the caption following the image
The framework of the proposed GSLR approach. GSLR learns action semantics and cooperation-aware action representation collaboratively. The action semantic encoding module manufactures action semantic embeddings, which are exploited to guide the cooperation-aware action representation learning within semantic-guided actor network. Furthermore, the reinforcement learning with cooperation-aware action representation in turn reinforces the action semantics learning with joint action reward information by way of policy gradient.

3.1. Problem Formulation

Fully cooperative MAS are generally modeled as Markov games [46] based on shared rewards, consisting of a tuple . describes the set of all possible states, which is a generalization used to represent the environment. and denote the joint action and observation space, respectively. is the reward function, is the observation function, and γ ∈ [0, 1) is the discount factor to encourage agents to focus more on immediate rewards. is set of agents. For arbitrary disjoint, ordered subsets of agents i1:m = {i1, …, im}, . At each time step, agent in the global state st receives an observation . After that, a joint observation is created using the collection of these observations. Based on the current state, the agents select specific action according to the joint policy π to interact with the environment iteratively. The goal of agents is to find the optimal joint policy π to maximize the expectation of accumulating rewards, i.e.,
()
where denotes the marginal observation distribution and R(Ot, At) is the reward value returned by the environment after executing action At, which is a shared reward function for all agents.

3.2. Action Semantic Encoding

The structure modeling of action semantic space is effective understanding of semantic relation information related to task goals, which is crucial for efficient cooperation. Action co-occurrence provides fundamental semantic relation information for the learning of intricate cooperative relationships among actions. In this section, we focus on the graph modeling of action co-occurrence semantic space so as to learn cooperation-aware action latent representations and enrich the MARL exploration process.

Inspired by the efficient sample utilization using the parallel training frameworks [11, 50], GLSR incorporate action statistics into multithread training framework to improve sample utilization. In addition to the interaction with the observations, the action latent presentations are embedded with cooperative relationship through interaction with action semantic embeddings. Specifically, an action relation graph is firstly constructed based on statistics of action co-occurrence, where the node set V corresponding to the actions set, with the edge set E standing for the corresponding collection of action pairs. Since the similar semantic embeddings should be shared by the actions with strong co-occurrence, the adjacency matrix is used to store the edge-related weights, standing for the conditional probability of co-occurrence (cooperation) of action pairs. The symmetric adjacency matrix can be formulated as follows:
()
where Pr(lj|li) represents the conditional probability of action j given action i and the elements along the diagonal of are zeros. The dimension of is k × k with k standing for the total number of action types, and it is constructed by the sampled actions in the replay buffer. The dimension of is determined by action space of agents, and it grows with the number of action types rather than the number of agents. This indicates the proposed action semantic encoding has good scalability. Although the adjacency matrix can be constructed in other alternative ways like a learnable form, we focus on its co-occurrence relationship based form which is embedded in the following collaborative learning framework for cooperation-aware representation and joint policy updating. We leave the research on other forms in the future work.
To integrate the structure information of co-occurrence semantic space into action semantic embeddings, a graph autoencoder is adopted for action semantic encoding network with parameter φ. The graph isomorphism network (GIN) [47]-based encoder is employed for graph learning with the input of an action relation graph. The GIN-based encoder grasps the action cooperation from the semantic relation in the co-occurrence semantic space. The node features output by A GIN layer can be obtained as
()
where denotes matrix of action nodes, fφ stands for a two-layer MLP with each linear layer followed by batch normalization and LeakyReLU function, and ϵ is the learnable weight of the remained information of the original node. H(0) is initialized according to the standard normal distribution instead of one-hot vectors. After encoding, is regarded as the output action semantic embeddings , for learning action latent representation in the semantic relation guided actor-critic network.
The GIN encodes the actions with cooperative relationships to adjacent locations in the co-occurrence semantic space by deterministically integrating the correlations among actions into an adjacency matrix. Since GIN is not able to increase the independence among the noncooperative actions only by neighborhood aggregation, it is difficult for the noncooperative action embeddings to pull apart from each other. Consequently, a pairwise decoder in the graph autoencoder is employed to parse the pairwise relationship between learned action semantic embeddings. The topological constraint on action semantic space is imposed with the function of the decoder adopted as follows:
()
where sim(ei, ej) stands for the cosine similarity calculation of ei and ej, and . GLSR provides cooperation-aware action semantic representation by capturing the structure of action semantic space rather than simply adopting prior knowledge of uncertain action semantic relations.

3.3. Semantic-Guided Actor Network

In the semantic-guided actor network, the action sequence of preceding agents is initially passed to the decoder, where is an arbitrary symbol indicating the start of decoding. The action correlations are incorporated into the cooperation-aware action representation learning. Meanwhile, the actor network propagates joint action reward information based on the cooperation-aware representation to the action semantic encoder.

Given action of k th category , a learnable mapping from original joint action space to cooperation-aware joint action representation space is adopted. Furthermore, the action semantic embeddings E from the action semantic encoder are employed to construct semantic specific learnable mapping ϑk, which can be obtained as
()
where ek represents the action semantic embedding corresponding to the k th category action lk, obtained from the k th row of E.
To realize this mapping, we first obtain the action embedding by mapping each ai from the original joint action space:
()
where Wa and ba are learnable parameters, ς is the GELU activation, and the embedded action of each agent can be optimized to facilitate the following interaction with cooperation embeddings.
Instead of the self-attention mechanism that only extracts correlations between actions [26], the cross-attention mechanism is then employed to learn the cooperation-aware joint action representation with the guidance of the action semantics. Concretely, we exploit an additional projection of a linear layer to obtain the cooperation embeddings:
()
where We and be are learnable parameters, which make ek the same dimension as . Successively, the cross-attention block followed by a two-layer MLP with residual connections is employed to incorporate the action semantics into the enhanced action latent representation. The cooperation relationships among actions of agents can be enhanced through the interaction between and , which results in
()
where WQ, WK, and WV are learnable parameters, the attention ξi,k is computed to estimate the importance of each action semantics, which benefits the suppression of weakly correlated action semantic information. Another cross-attention mechanism is incorporated in the last decoding block for the interaction between the enhanced action latent representation and encoded observation from the critic network, generating cooperation-aware action representation:
()
where the attention ηi,j is computed by adopting sequence mask operation on i th observation to achieve the action sequence updating for j th agent wherein i < j. To prevent gradient vanishing with increasing depth, the last decoding block employs an MLP (implemented as a two-layer perceptron with GELU activation) combined with skipping connections to output the joint action representations . Note that the linear layers scale proportionally with the feature dimension rather than the number of agents. Then, the representations are passed into an MLP, which generates the probability distribution of im th agent’s action}, namely, the policy , that guarantees monotonic performance improvement during training [26]. In the inference stage, action of im-th agent is generated in a sequential manner, while during the training stage, the output of all actions can be computed in parallel simply because have already been collected and stored in the replay buffer. As a result, the joint policy πϑ = (a1, …, an) of agents can be obtained simultaneously. The parallel training architecture of semantic-guided decoding block with attention mechanism as the kernel effectively mitigated the limitations inherent in sequential update paradigms, thus have far less time complexity. The cooperation-aware representations are processed by the actor network to construct the following clipping PPO objective [11] for actor network training:
()
where is the joint advantage function constructed by the advantage decomposition theorem [12]. Using the output of the critic network, generalized advantage estimation (GAE) [48] is adopted to facilitate multiagent credit assignment.

3.4. Critic Network

The sequence of observations (o1, …, on) is encoded by the critic network with parameters ϕ through several encoding blocks, and output the latent representations, which encode observations into latent representations. To avoid gradient vanishing and network degradation with increasing depth, each encoding block employs a relation-enhanced mechanism [55, 56] followed by a two-layer MLP with residual connections. The encoded observations are represented as (), which integrate the information inherent in individual agents as well as the higher-level correlations through their interactions. An extra projection is added in the critic network to estimate the value function and provide the environmental perceptions for actor network learning. The critic network will be trained with an update rule based on the squared Bellman error as follows:
()
where represents the parameters of the target critic network, which can be periodically update to avoid oscillation during training [49].

3.5. Optimization Objective

GLSR offers an impressive action semantic encoder to help agents learn cooperation-aware action representation during the end-to-end training process. The overall objective function of the GLSR framework consists of the following three components:
()
where is the minimization objective of the empirical Bellman error, which can encourage agents to learn their own policies by the value function based on the critic network, is clipping PPO objective, and is the action semantic embedding loss, with β a trade-off parameter. In cooperation-aware action representation learning for more diverse behaviors, and help embed action into a co-occurrence semantic space and explore the action semantic relations in joint action space for cooperative agents.

We describe the entire procedure in Algorithm 1. The cooperative-aware action representation learning is incorporated in the MARL, which further improves the global rewards by guiding agents to make more diversely cooperative behaviors rather than simple reward maximization of some homogeneous behavior. For each episode, there are M number of parallel threads for training in Table A2 threads with each thread collecting T timesteps of data. All hyperparameter settings can be found in Tables A1 and A2. Specifically, we provide the detailed calculation of the symmetric adjacency matrix. We first calculate the k-dimensional binary vector of co-occurring action types within all the agents on the current time step, where one denotes as the action occurring with zero standing for the nonexisting of the action at the same time step. Thus, we can obtain the k × M matrix by stacking M vectors that corresponds to the M parallel threads for training. We compute the self-correlation of the matrix and divide its κ-th column by the number of times that the κ-th type of action occurs within all the parallel threads. We set the diagonal elements of resulting matrix to be zeros to obtain conditional probability matrix and finally calculate the symmetric adjacency matrix according to (1).

3.6. Extension of GLSR to CTDE Framework

For better scalability, we also introduce a CTDE variant of GLSR called GLSR-var. GLSR-var preserves the centralized critic network for global value estimation while deploying fully decentralized semantic-guided actor networks for each agent. These actor networks only work with partial observations and the alternative adjacency matrices are constructed by one-hot vectors for eliminating dependency on previous agents’ actions. The loss function for actor networks is defined as:
()
where denotes the local advantage estimation.
    Algorithm 1: GLSR for MARL.
  • 1.

    Input: Batch size b, number of agents n, parallel threads M, episodes K, steps per episode T.

  • 2.

    Initialize: Semantic encoding network φ0, actor network ϑ0, critic network ϕ0, replay buffer D.

  • 3.

    fork = 0, 1, 2, …, K − 1

  • 4.

    for step t = 0, 1, 2, …, T − 1

  • 5.

    /∗Semantic guided inference phase∗/

  • 6.

      form = 0, 1, 2, …, n − 1

  • 7.

      Count the occurrence number Ni (Nj) of each action type and the co-occurrence number Nij of different actions i and j at each time step in all parallel threads M.

  • 8.

      Compute by conditional probability Pr(li|lj) = Nij/Nj.

  • 9.

      Infer in a sequential manner with and based on (3).

  • 10.

      Collect a set of trajectories and insert into D.

  • 11.

      end

  • 12.

    /∗Semantic guided training phase∗/

  • 13.

     Sample a random minibatch B from D.

  • 14.

     Calculate the joint advantage function with GAE.

  • 15.

     Input to generate E and cooperation representation zi.

  • 16.

     Input , , and to generate at once with parallel training model.

  • 17.

     Compute LSE(φ) with (4).

  • 18.

     Estimate LActor(ϑ) and LCritic(ϕ) with (10) and (11).

  • 19.

     Update network by minimizing objectives in (12).

  • 20.

    end

  • 21.

    end

4. Experiments

GLSR provides a collaborative learning framework for cooperative MAS tasks. The fundamental principle of GLSR is the cooperative-aware action representation learning incorporated into the multiagent trust region policy optimization procedure, which inherits the merit of monotonic performance improvement during training. The structure information in action semantic space is graph modeled by encoding into the relations and dependencies among different action semantics, which is based on highly efficient implementation from a sequence modeling perspective. To validate the cooperation performance of our proposed GLSR, we conduct a series of experiments on the SMAC benchmark [51], which features comparatively complicated scene settings and is one of the most commonly used benchmarks in the cooperative MARL field. Figure 3 displays various task scenarios in SMAC.

Details are in the caption following the image
Demonstrations of the SMAC task scenarios. (a) MMM2, (b) 6h vs 8z, (c) 27m vs 30m, and (d) 3s5z vs 3s6z.

We compared GLSR with three state-of-the-art methods, MAPPO, HAPPO, and MAT that are based on multiagent trust region optimization framework. During the experiment, the implementation of the baseline method was consistent with its official repository, and all hyperparameters stay unchanged in the original optimal performance state. Specific parameter settings for each of the methods are contained in Appendix A.

4.1. Performance on SMAC

For the sake of generality, we first carried out algorithm performance verification experiments on hard and superhard maps, such as 3s_vs_5z, MMM2, 6h_vs_8z, 27m_vs_30m, and 3s5z_vs_3s6z. Figure 4 shows the performance comparison results on 3s_vs_5z map, which is one of the hard maps with asymmetric scenario. All the methods compared show similarly high win rate, but GLSR exhibits the fastest convergence and the lowest dead ratio which indicates that more cooperative behaviors can benefit not only high win rate but also the loss reduction of allied forces.

Details are in the caption following the image
Performance comparison between MAPPO/HAPPO/MAT and GLSR on map 3s vs 5z.
Details are in the caption following the image
Performance comparison between MAPPO/HAPPO/MAT and GLSR on map 3s vs 5z.

We further consider several superhard maps of MMM2, 6h_vs_8z, 27m_vs_30m and 3s5z_vs_3s6z, respectively, with more possibility of adopting diverse attempts for optimal policy. In these asymmetric maps, enemy outnumbers allied forces by one or more, especially in MMM2 and 3s5z_vs_3s6z, where each side contains more than one type of heterogeneous agent unit, so the cooperative behavior may be more important than simple independent behavior with high reward to help win the battle. Figures 5(a), 5(c), 5(e), and 5(g) show that the state-of-the-art algorithms, such as HAPPO, MAPPO, and MAT, tend to a suboptimal policy where agents preferentially take homogeneous behavior rather than potential heterogeneous behavior exploration. By contrast, GLSR significantly outperforms the other methods, which can provide a semantic relation guided cooperation-aware action representation learning for MARL to realize heterogeneous exploration and get rid of converge to the suboptimal joint policy. The cooperation-aware action rewards should be received where individual agents are able to adjust their policy in accordance with cooperative semantics and contribute to the success of the team. The proposed GLSR would explore more diverse action combinations instead of focusing on the individual action with current reward optimization, which would lead to the more fluctuations in win rate performance before convergence than conventional methods. After convergence, the fluctuations become trivial, which proves the ability of GLSR that can ensure consistent policy improvement from the global perspective.

Details are in the caption following the image
Performance comparison between MAPPO/HAPPO/MAT and GLSR on different superhard map tasks of the SMAC environment.

To demonstrate the effectiveness of the proposed method in complex tasks, a statistical analysis of the win rate and standard deviation comparison with MAT, MAPPO, HAPPO, and RODE are presented in Table 1. It is shown that GLSR outperforms the other methods. Although the algorithms like RODE are also based on action space modeling, they tend to require more samples to explore without exploiting the semantic information of action relation, which prevents them from performing better. By contrast, GLSR make an endeavor to incorporate the semantic relation encoding into action representation, which facilitates its competitive performance in the tasks that requires complex cooperation.

Table 1. Comparison results on SMAC maps for different methods.
Map Difficulty MAT MAPPO HAPPO RODE GLSR Steps
3s5z Hard 100 (1.9) 96.9 (0.7) 90.0 (3.5) 93.8 (2.0) 100 (0.3) 1e7
3s_vs_5z Hard 95.1 (1.7) 100 (2.5) 91.9 (5.3) 78.9 (4.2) 100 (1.8) 1e7
5m_vs_6m Hard 89.5 (5.7) 89.1 (2.5) 73.8 (4.4) 71.1 (9.2) 91.6 (4.4) 1e7
6h_vs_8z Superhard 98.8 (1.3) 88.3 (3.7) 10.1 (1.4) 78.1 (37.0) 99.6 (0.9) 2e7
MMM2 Superhard 93.8 (2.6) 81.8 (10.1) 0.0 (0.0) 89.8 (6.7) 100 (0.0) 2e7
27m_vs_30m Superhard 96.1 (5.7) 93.8 (2.4) 53.0 (37.0) 96.8 (1.5) 100 (0.7) 1e7
  • Note: The bold values indicate the best performance.

4.2. Ablation Experiments and Analysis

We analyze the primary factors of the GLSR performance. Ablation results have been provided on two superhard maps 3s5z_vs_3s6z and 27m_vs_30m.

4.2.1. Analysis of the Collaborative Learning

We compare a two-stage model denoted as GLSR-ts, which sequentially encodes the action semantics and learns the cooperation-aware action representations in two-stage manner instead of collaboratively learning. In other words, GLSR-ts firstly employs graph encoder to integrate semantic relations into action semantic embeddings with the objective in (4). Then, GLSR-ts obtains cooperation-aware action latent representations and performs action prediction with the clipping PPO objective , while the obtained action semantic embeddings are frozen. As is shown in Figure 6, the effectiveness of the collaborative learning can be validated, especially in modeling of more heterogeneous behaviors.

Details are in the caption following the image
Respective contributions of each module in GLSR on 27m_vs_30m and 3s5z_vs_3s6z.
Details are in the caption following the image
Respective contributions of each module in GLSR on 27m_vs_30m and 3s5z_vs_3s6z.

4.2.2. Analysis of the Semantic-Guided Interaction Based on Cross-Attention Mechanism

We also compare a simple interaction-based GLSR denoted as GLSR-cf, where embedded actions and cooperation embeddings are concatenated and fused through a two-layer MLP instead of the interaction through the cross-attention mechanism. As is shown in Figure 6, GLSR significantly outperforms GLSR-cf variation in terms of win rate.

4.2.3. Analysis of the Action Semantic Encoding

For validating the effectiveness of the GLSR, we compare two plain models denoted as GLSR-ge and GLSR-oe, which are implemented without action semantic encoding. They utilize the action embeddings generated by standard normal distribution function and one-hot vectors, respectively. Results in Figure 6 show the statistical effectiveness of the action semantic encoding. This is because the action embeddings of GLSR learn the structure of the action semantic space so that they can obtain more semantic relation information than Gaussian or one-hot-based embeddings.

4.3. Training Data Utilization

A key component of policy iteration is the use of importance sampling to sample reuse, and for the sake of data integrity, a large batch size is commonly used together with training for tens of epochs. In order to gain the deep insight of GLSR’s performance effected by the training data quantities, we consider different multiples of the batch size adopted in initial results, represented as 0.5x, 1x, 0.5x, and 2x, respectively, where x is the product of episode length and rollout threads as shown in Tables A1 and A2 of Appendix A. The final win rates are displayed by red bar clusters. The blue bar clusters show the total number of environment steps essential to achieve a high-level win rate (90% on 27m_vs_30m and 80% on 3s5z_vs_3s6z) as a measure of sample efficiency. From Figure 7, we observe that in superhard maps, larger data batches typically result in better convergence for GLSR, which benefits value function estimation and policy updating. However, an overabundance of data could place burdens on computational resources and result in the worsening sample efficiency. Therefore, we suggest using larger batches of data based on computing resources to attain optimal performance, and then fine-tuning the data batch to maximize data efficiency.

Details are in the caption following the image
Effect of batch size on GLSR.

4.4. Parameter Sensitivity

Figure 8 gives an illustration on how the performance of GLSR changes with respect to win rate with the varying trade-off parameter β. The parameter β represents the impact of semantic relation learning on multiagent action prediction of multiagent, especially in complex tasks that require diversified behavior exploration and more skillful coordination. Figure 8 shows the comparison with parameter β between 0 and 2 on superhard SMAC tasks. When the β is large, GLSR exhibits a remarkably rapid growing trend with regard to win rate, which may benefit from the effective exploration of cooperative behavior via our action semantic encoding. In our experiment, we choose β = 1 for GLSR in various tasks.

Details are in the caption following the image
Effect of different coefficient β on GLSR.
Details are in the caption following the image
Effect of different coefficient β on GLSR.

However, too small β may hinder the framework from achieving higher win rate, since the independence among the noncooperative action embeddings can hardly be guaranteed without the decoding with in (4). As is shown in Figure 8 that the case of β = 0 that corresponds to GLSR without suffers from the severe performance degradation. This indicates that is effective in decoding for the noncooperative action embeddings to pull apart from each other, which is necessary for complex cooperation modeling.

4.5. Comparison Results in Distributed Scenario

For fair comparison, we present statistical results of the win rate and standard deviation comparison of the GLSR-var with other CTDE-based methods such as MAPPO and MAT-Dec (CTDE variants of MAT), which use a completely decentralized actor network for each agent. The performance results are shown in Table 2. It is shown that our GLSR-var outperforms other CTDE-based methods, which is mainly attributed to the cooperation-aware representation.

Table 2. Comparison results on SMAC maps for CTDE-based methods.
Map Difficulty MAT-dec MAPPO GLSR-var Steps
3s5z Hard 100 (3.3) 96.9 (0.7) 100 (2.3) 1e7
3s_vs_5z Hard 100 (1.7) 100 (2.5) 100 (0.6) 1e7
5m_vs_6m Hard 83.1 (4.6) 89.1 (2.5) 86.1 (4.5) 1e7
6h_vs_8z Superhard 93.8 (4.7) 88.3 (3.7) 95.1 (3.7) 2e7
MMM2 Superhard 91.2 (5.3) 81.8 (10.1) 92 (6.3) 2e7
3s5z_vs_3s6z Superhard 85.3 (7.5) 74.3 (8.4) 89.3 (2.2) 2e7
  • Note: The bold values indicate the best performance.

5. Conclusion

We present the GLSR method within collaborative learning framework in this paper, where the learning of action semantics and action latent representation can mutually guide and reinforce each other. GLSR exploits the structure of the action semantic space for the learning of intricate cooperative relationship among actions, which facilitates the cooperation-aware representation and policy optimization of MARL. The action semantic relations captured by graph autoencoder are utilized to prompt actor network to learn the cooperation-aware joint action representation, which implicitly guides agent cooperation in joint policy space for more diverse behaviors of cooperative agents. We have demonstrated that GLSR achieves highly competitive performance against current well-established MARL methods. In the future, we will develop more sophisticated interaction between action semantics and latent representation since it may be significant to identify the most influential edge between two elements that contributes to the final performance.

Conflicts of Interest

The authors declare no conflicts of interest.

Funding

This work was supported by the Key Research and Development Program of Shaanxi (Program No. 2022ZDLGY03-02), the National Natural Science Foundation of China (Grant Nos. 62106134, 62476159), and Qinchuangyuan Project of “Scientists + Engineers” Team Construction in Shanxi Province (Grant Nos. 2022KXJ-035, 2023KXJ-286).

Acknowledgments

This work was supported by the Key Research and Development Program of Shaanxi (Program No. 2022ZDLGY03-02), the National Natural Science Foundation of China (Grant Nos. 62106134, 62476159), and Qinchuangyuan Project of “Scientists + Engineers” Team Construction in Shanxi Province (Grant Nos. 2022KXJ-035, 2023KXJ-286).

    Appendix A: Implementation Details of GLSR

    In this paper, two GIN layers are employed in GLSR to encode action semantic relation for semantic embeddings. Table A1 shows the implementation details. Specifically, the dimensionality of the cooperation embeddings as shown in “embedding size” is used for a semantic specific learnable mapping. “GLSR_lr” refers to the learning rate of action semantic encoding network and “weight decay” simplifies the complexity of graph learning. The maximal training data length of action semantic encoding module should be set to override the collected action data in replay buffer.

    In our experiments, the parameter settings of compared algorithms remain unchanged according to their best performance settings from their original repositories. The proposed method also adopts common practices including death masking, GAE with the PopArt value normalization and value clipping. The common hyperparameters adopted in the SMAC domain are listed in Table A2. The different hyperparameters used for respective algorithms and tasks are listed in Table A3. In these tables, “episode length” stands for the number of environment steps over a trajectory collected at once and “epoch number” represents the training number of sequence trajectory data in a “num mini-batch,” which is set to 1 especially in superhard maps. “Clip parameter” is the parameter for clipping term in the loss function. “Num_env_steps” stands for the total number of environment steps over the whole task. “Use huber loss” whether the huber loss function is applied to the optimized procedure and “huber delta” refers to δ in Huber loss function. “GLSR_coef” represents β in equation (12) which is set to 1. “Lr” represents the learning rate of critic and actor networks and “gamma” refers to the discount factor for rewards. “GAE_lambda” describes the GAE lambda parameter for balancing variance and bias of estimation value.

    Table A1. Hyperparameters used in the semantic relation graph learning module.
    Hyperparameters Value
    GLSR_lr 1e − 3
    Weight decay 1e − 5
    Embedding size 128
    Activation ReLU
    Hidden layer 1
    Max epoch length 800
    Table A2. Hyperparameters adopted in SMAC.
    Hyperparameters Value
    Actor 1
    GLSR 1
    lr 5e − 4
    Gamma 0.99
    gae_lambda 0.95
    Use huber loss True
    Activation ReLU
    Optimizer Adam
    Use PopArt True
    Huber delta 10
    Num mini-batch 1
    Eval episodes 32
    Training threads 1
    Max grad norm 10
    Rollout threads 32
    Table A3. Different hyperparameters used for GLSR, MAT, MAPPO, and HAPPO in hard (H) and superhard (SP) maps.
    Hyperparameters GLSR MAT MAPPO HAPPO
    H SP H SP H SP H SP
    Episode length 200 400 100 100 100 100 100 100
    Epoch number 5 5 5 15 5 15 5 15
    Clip parameter 0.2 0.2 0.2 0.05 0.2 0.05 0.2 0.05
    num_env_steps 1e7 2e7 1e7 2e7 1e7 2e7 1e7 2e7

    Data Availability Statement

    The data that support the findings of this study are available from the corresponding author upon reasonable request.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.