Mastering Complex Coordination through Attention-based Dynamic Graph Guangchong Zhou1,2 , Zhiwei Xu1,2 , Zeren Zhang1,2 , and Guoliang Fan1 Institute of Automation, Chinese Academy of Sciences School of Artificial Intelligence, University of Chinese Academy of Sciences Beijing, China {zhouguangchong2021, xuzhiwei2019, zhangzeren2021, guoliang.fan}@ia.ac.cn 1 arXiv:2312.04245v1 [cs.MA] 7 Dec 2023 2 Abstract. The coordination between agents in multi-agent systems has become a popular topic in many fields. To catch the inner relationship between agents, the graph structure is combined with existing methods and improves the results. But in large-scale tasks with numerous agents, an overly complex graph would lead to a boost in computational cost and a decline in performance. Here we present DAGMIX, a novel graphbased value factorization method. Instead of a complete graph, DAGMIX generates a dynamic graph at each time step during training, on which it realizes a more interpretable and effective combining process through the attention mechanism. Experiments show that DAGMIX significantly outperforms previous SOTA methods in large-scale scenarios, as well as achieving promising results on other tasks. Keywords: Multi-Agent Reinforcement Learning · Coordination · Value Factorization · Dynamic Graph · Attention. 1 Introduction Multi-agent systems (MAS) have become a popular research topic in the last few years, due to their rich application scenarios like auto-driving [16], cluster control [26], and game AI [22,1,29]. Communication constraints exist in many MAS settings, meaning only local information is available for agents’ decisionmaking. A lot of research discussed multi-agent reinforcement learning (MARL), which combines MAS and reinforcement learning techniques. The simplest way is to employ single-agent reinforcement learning on each agent directly, as the independent Q-learning (IQL) [20] does. But this approach may not converge as each agent faces an unstable environment caused by other agents’ learning and exploration. Alternatively, we can learn decentralized policies in a centralized fashion, known as the centralized training and decentralized execution (CTDE) [9] paradigm, which allows sharing of individual observations and global information during training but limits the agents to receive only local information while executing. Focusing on the cooperative tasks, the agents suffer from the credit assignment problem. If all agents share a joint value function, it would be hard to 2 G. Zhou et al. measure each agent’s contribution to the global value, and one agent may mistakenly consider others’ credits as its own. Value factorization methods decompose the global value function as a mixture of each agent’s individual value function, thus solving this problem from the principle. Some popular algorithms such as VDN [18], QMIX [14], and QTRAN [17] have demonstrated impressive performance in a range of cooperative tasks. The above-mentioned algorithms focus on constructing the mixing function mathematically but ignore the physical topology of the agents. For example, players in a football team could form different formations like ’4-3-3’ or ’4-3-1-2’ (shown in Figure 1), and a player has stronger links with the nearer teammates. An intuitive idea is to use graph neural net- Fig. 1. Topology of football players. works (GNN) to extract the information contained in the structure. Multi-agent graph network (MAGNet) [11] generates a real-time graph to guide the agents’ decisions based on historical and current global information. In QGNN [6], the model encodes every agent’s trajectory by a gated recurrent unit (GRU), which is followed by a GNN to produce a local Q-value of each agent. These methods do make improvements on original algorithms, but they violate the communication constraints, making it hard to extend to more general scenarios. Xu et al. modify QMIX and propose multigraph attention network (MGAN) [27], which constructs all agents as a graph and utilizes multiple graph convolutional networks (GCN [5]) to jointly approximate the global value. Similar but not the same, Deep Implicit Coordination Graph (DICG) [7] processes individual Q-values through a sequence of GCNs instead of parallel ones. These algorithms fulfill CTDE and significantly outperform the benchmarks. However, these methods prefer fully connected to sparse when building the graph. As the number of agents grows, the complexity of a complete graph increases by square, leading to huge computational costs and declines in the results. In this paper, we proposed a cooperative multi-agent reinforcement learning algorithm called Dynamic Attention-based Graph MIX (DAGMIX). DAGMIX presents a more reasonable and intuitive way of value mixing to estimate Qtot more accurately, thus providing better guidance for the learning of agents. Specifically, DAGMIX generates a partially connected graph rather than fully connected according to the agents’ attention on each other at every time step during training. Later we perform graph attention on this dynamic graph to help integrate individual Q-values. Like other value factorization methods, DAGMIX perfectly fulfills the CTDE paradigm, and it also meets the monotonicity assumption, which ensures the consistency of global optimal and local optimal policy. Experiments on StarCraft multi-agent challenge (SMAC) [15] show that our work is comparable to the baseline algorithms. DAGMIX outperforms previous SOTA methods significantly in some tasks, especially those with numerous Mastering Complex Coordination through Attention-based Dynamic Graph 3 agents and asymmetric environmental settings which are considered as superhard scenarios. 2 Background 2.1 Dec-POMDP A fully cooperative multi-agent environment is usually modeled as a decentralized partially observable Markov decision process (Dec-POMDP) [12], consisting of a tuple G = ⟨S, U , P, Z, r, O, n, γ⟩. At each time-step, s ∈ S is the current global state of the environment, but each agent a ∈ A := {1, ..., n} receives only a unique local observation za ∈ Z produced by the observation function O(s, a) : S × A → Z. Then every agent a will choose an action ua ∈ U, and all individual actions form the joint action u = [u1 , ..., .un ] ∈ U ≡ U n . After the interaction between the joint action u and current state s, the environment changes to state s′ according to state transition function P(s′ ∥s, u) : S × U × S → [0, 1]. All the agents in Dec-POMDP share the same global reward function r(s, u) : S × U → R. γ ∈ [0, 1) is the discount factor. In Dec-POMDP, each agent a chooses action based on its own action-observation history τa ∈ T ≡ (Z × U), thus the policy of each agent a can be written as πa (ua |τa ) : T × U → [0, 1]. The joint action-value function can be computed by the following equation: Qπ (st , ut ) = Est+1:∞ ,ut+1:∞ [Rt |st , ut ], where π is the joint P∞policy of all agents. The goal is to maximize the discounted return Rt = l=0 γ l rt+l . 2.2 Value Factorization Methods Credit assignment is a key problem in cooperative MARL problems. If all agents share a joint value function, it would be hard for a single agent to tell how much it contributes to global utilization. Without such feedback, learning is easy to fail. In value factorization methods, every agent has its own value function for decision-making. The joint value function is regarded as an integration of individual ones. To ensure that the optimal action of each agent is consistent with the global optimal joint action, all value decomposition methods comply with the Individual Global Max (IGM) [14] conditions described below:   arg maxu1 Q1 (τ1 , u1 )   .. arg max Qtot (τ , u) =  , . u arg maxun Qn (τn , un ) where Qtot = f (Q1 , ..., Qn ), Q1 , ..., Qn denote the individual Q-values, and f is the mixing function. 4 G. Zhou et al. For example, VDN assumes the function f is a plus operation, while QMIX makes monotonicity constraints to meet the IGM conditions as below: (VDN) Qtot (s, ua ) = n X Qa (s, ua ). a=1 (QMIX) 2.3 ∂Qtot (τ , u) ≥ 0, ∂Qa (τa , ua ) ∀a ∈ {1, . . . , n}. Self Attention Consider a sequence of vectors α = (a1 , ..., an ), the motivation of self-attention is to catch the relevance between each pair of vectors [21]. An attention function can be described as mapping a query (q) and a set of key-value (k, v) pairs to an output (o), where the query, keys, and values are all translated from the original vectors. The output is derived as below:   exp s qi , kj i j , wij = softmax s q , k =P i t t exp (s (q , k )) X oi = wij vj , j where s (·, ·) is a user-defined function to measure similarity, usually dotproduct. In practice, a multi-head implementation is usually employed to enable the model to calculate attention from multiple perspectives. 2.4 Graph Neural Network Graph Neural Networks (GNNs) are a class of deep learning methods designed to directly perform inference on data described by graphs and easily deal with node-level, edge-level, and graph-level tasks. The main idea of GNN is to aggregate the node’s own information and its neighbors’ information together using a neural network. Given a graph denoted as G = (V , E) and consider a K-layer GNN structure, the operation of the k-th layer can be formalized as below [3]:     hkv = σ Wk · AGG hk−1 , Bk hk−1 , u , ∀u ∈ N (v) v where hkv refers to the value of node v at k-th layer, N is the neighborhood function to get the neighbor nodes, and AGG ( . . . ) is the generalized aggregator which can be a function such as Mean, Pooling, LSTM and so on. Unlike other deep networks, GNN could not improve its performance by simply deepening the model, since too many layers in GNN usually lead to severe over-smoothing or over-squashing. Besides, the computational complexity of GNN is closely related to the amount of nodes and edges. Fortunately, DAGMIX tries to generate an optimized graph structure, making it suitable for applying GNN. Mastering Complex Coordination through Attention-based Dynamic Graph 𝑄𝑎 (𝝉𝑎 , 𝑢𝑎 ) 𝜋 5 𝐬𝑡 𝑄𝑡𝑜𝑡 𝐬𝑡 𝒘 𝜖 |⋅| MIXING NETWORK 𝑄𝑎 (𝝉𝒂 ,⋅) 𝑄1′ Attention MLP 𝑄1 (𝝉1 , 𝑢1𝑡 ) ··· 𝑄𝑛 (𝝉𝑛 , 𝑢𝑛𝑡 ) 𝐡𝑡−1 𝑎 GRU ··· 𝑄𝑛′ V Agent 1 ··· Agent N K FC FC (𝐨1𝑡 , 𝝉1𝑡 ) ··· 𝐡𝑡−1 𝐱1𝑡 ··· 𝐱 𝑛𝑡 MLP 𝐨𝑡𝑎 , 𝝉𝑡𝑎 Q MLP 𝐡𝑡𝑎 (𝐨𝑡𝑛 , 𝝉𝑡𝑛 ) ··· 𝐡𝑡 Pair-Wise Concat MLP 𝐨1𝑡 Bi-GRU 𝐨𝑡𝑛 Fig. 2. The overall framework of DAGMIX. The individual Q network is shown in the blue box, and the mixing network is shown in the green box. 3 DAGMIX In this section, we’d like to introduce a new method called DAGMIX. Compared to other related studies, DAGMIX enables a dynamic graph in the training process of cooperative MARL. It fulfills the CTDE paradigm and IGM conditions perfectly and provides a more precise and explainable estimation of all agents’ joint action, thus reaching a better performance than previous methods. The overall architecture is shown in Figure 2. 3.1 Dynamic Graph Generation Each agent corresponds to a node in the graph. Instead of relying on a fully connected graph, DAGMIX employs a dynamic graph that generates a real-time structure at each time step. Inspired by G2ANet [8], we adopt the hard-attention mechanism. It computes the attention weights between all the nodes and then sets the weights to 0 or 1 representing if there are connections between each pair of nodes in the dynamic graph. Nonetheless, DAGMIX removes the communication between agents in G2ANet and enables fully decentralized decision-making. Assume that we take out an episode from the replay buffer. First, we encode the observation ota of each agent a at each time-step t as the embedding xta . Then we perform a pair-wise concatenation  on all the embedding vectors, which yields a matrix xtn×n where xtij = xti , xtj , i, j = 1, ..., n. This matrix xt can be regarded as a batch of sequential data since it has a batch size of n corresponding to n agents, and every agent a has a sequence from (xta , xt1 ) to (xta , xtn ). We could 6 G. Zhou et al. naturally think of using a sequential model like GRU to process every sequence and calculate the attention between agents. Notably, the output of traditional GRU only depends on the current and previous inputs but ignores the subsequent ones. Therefore, the agent at the front of the sequence has a greater influence on the output than the agent at the end, making the order of the agents really crucial. Such a mechanism is apparently unjustified for all agents, so we adopt a bi-directional GRU (Bi GRU) to fix it. The calculation on the concatenation xti , xtj is formalized by Equation (1), where i, j = 1, ..., n and f (·) is a fully connected layer to embed the output of GRU. ai,j = f (Bi-GRU (xi , xj )) . (1) Then we need to decide if there’s a link between agent i and j. One approach is to sample on 0 and 1 based on ai,j , but simply sampling is not differentiable and makes the graph generation module untrainable. To maintain the gradient continuity, DAGMIX adopts gumbel-softmax [4] as follows: gum (x) = softmax x + log λe−λx   /τ , (2) where gum (·) means the gumbel-softmax function, λ is the hyperparameter of the exponential distribution, and τ is the hyperparameter referring to the temperature. As the temperature decreases, the output of Gumbel-softmax gets closer to one-hot. In DAGMIX, Gumbel-softmax returns a vector of length 2. T ( · , Ai,j ) = gum (f (Bi-GRU (xi , xj ))) . (3) Combining Equation (2) and Equation (1) we get Equation (3). A is the adjacency matrix whose element Ai,j is the second value in the output of Gumbelsoftmax, indicating whether there’s an edge from agent j to agent i. Finally, we get a graph structure similar to the one illustrated on the right side of Figure 2. It is characterized by a sparse connectivity pattern that exhibits an expansion in sparsity as the graph scales up. 3.2 Value Mixing Network As we’ve already got a graph structure (shown in the right of Figure 2), we design an attention-based value integration on the dynamic graph, which can be viewed as performing self-attention on the original observations. More specifically, there are two independent channels to process the observation. On the one hand, the Q network receives current observation as input and outputs an individual Q value estimation, which serves as the value in selfattention. On the other hand, the observation is encoded with an MLP in the mixing network, which is followed by two fully connected layers representing Wq and Wk . The query and key of agent a in self attention are derived through q a = Wq xa and ka = Wk xa respectively. Mastering Complex Coordination through Attention-based Dynamic Graph 7 Besides, restricted by the dynamic graph, we only calculate the attention scores in agent a’s neighborhood N (a) = {i ∈ A|Aa,i = 1}, which can be regarded as a mask operation based on the adjacency matrix A. The calculation process in matrix form is shown below: T Q = XWq , K = XWk , V = (Q1 , ..., Qn ) , !! T  ′ ′ QK T Q1 , ..., Qn = softmax Mask √ V, dk (4) T where X = (x1 , ..., xn ) , and dk is the hidden dimension of Wk . According to Equation (4),  DAGMIX  blends the original Q-values of all agents and transforms ′ ′ them into Q1 , ..., Qn .  ′  ′ In the last step, we need to combine Q1 , ..., Qn into a global Qtot . As the dotted black box in Figure 2 shows, we adopt a QMIX-style mixing module, making DAGMIX benefit from the global information. The hypernetworks [2] take in global state st and output the weights w and bias b, thus the Qtot is calculated as:  ′  ′ Qtot = Q1 , ..., Qn w + b (5) The operation of taking absolute values guarantees weights in w non-negative. And in Equation 4, the coefficients of any Qa are also non-negative thanks to the softmax function. Equation 6 ensures DAGMIX fits in the IGM assumption perfectly. ′ n X ∂Qtot ∂Qi ∂Qtot = ≥ 0, ′ ∂Qa ∂Qi ∂Qa i=1 ∀a ∈ {1, ..., n} (6) It is noteworthy that DAGMIX is not a specific algorithm but a framework. As depicted by the green box in Figure 2, the part in the red dotted box can be replaced by any kind of GNN (e.g., GCN, GraphSAGE [3], etc.), while the mixing module in the black dotted box can be substituted with any value factorization method. DAGMIX enhances the performance of the original algorithm, especially on large-scale problems. 3.3 Loss Function Like other value factorization methods, DAGMIX is trained end-to-end. The loss function is set to TD-error, which is similar to value-based SARL algorithms [19]. We denote the parameters of all neural networks as θ which is optimized by minimizing the following loss function: 2 L(θ) = (ytot − Qtot (τ , u|θ)) , (7) where ytot = r+γ maxu′ Qtot (τ ′ , u′ |θ− ) is the target joint action-value function. θ− denotes the parameters of the target network. The training algorithm is displayed in Algorithm 1. 8 G. Zhou et al. Algorithm 1: DAGMIX Initialize replay buffer D Initialize [Qa ], Qtot with random parameters θ, initialize target parameters θ− = θ while training do for episode ← 1 to M do  Start with initial state s0 and each agent’s observation o0a = O s0 , a Initialize an empty episode recorder E for t ← 0 to T do For every agent a, with probability ϵ selectaction uta randomly Otherwise select uta = arg maxuta Qa τat , uta Take joint action ut , and retrieve next state st+1 , next observations ot+1 and reward rt  Store transition st , ot , ut , rt , st+1 , ot+1 in E end Store episode data E in D end Sample a random mini-batch data B with batch size N from D for t ← 0 to T − 1 do  from B Extract transition st , ot , ut , rt , st+1 , ot+1  For every agent a, calculate Qa τat , uta |θ Generate the dynamic graph G t using ot based on Equation (3)  According to the structure of G t , calculate Qtot τ t , ut |θ based on Equation (4) and (5) With target network,   calculate t+1 − = max Qa τa , · |θ− Qa τat+1 , ut+1 |θ a Calculate Qtot τ t+1 , ut+1 |θ− end Update θ by minimizing the total loss in Equation (7) Update target network parameters θ− = θ periodically end 4 Experiments 4.1 Settings SMAC is an environment for research in the field of collaborative multi-agent reinforcement learning (MARL) based on Blizzard’s StarCraft II RTS game. It consists of a set of StarCraft II micro scenarios that aim to evaluate how well independent agents are able to learn coordination to solve complex tasks. The version of StarCraft II is 4.6.2(B69232) in our experiments, and it should be noted that results from different client versions are not always comparable. The difficulty of the game AI is set to very hard (7). To conquer a wealth of challenges with varying levels of difficulty in SMAC, algorithms should adapt to different scenarios and perform well both in single-unit control and group coordination. SMAC has become increasingly popular recently for its ability to comprehensively evaluate MARL algorithms. Mastering Complex Coordination through Attention-based Dynamic Graph 9 The detailed information of the challenges used in our experiments is shown in Table 1. All the selected challenges have a large number of agents to control, and some of them are even asymmetric or heterogeneous, making it extremely hard for MARL algorithms to handle these tasks. Our experiment is based on Pymarl [15]. To judge the performance of DAGMIX objectively, we adopt several most popular value factorization methods (VDN, QMIX, and QTRAN) as well as more recent methods including Qatten [28], QPLEX [23], W-QMIX [13], MAVEN [10], ROMA [24] and RODE [25] as baselines. The hyperparameters of these baseline algorithms are set to the default in Pymarl. Table 1. Information of selected challenges. Challenge Ally Units Enemy Units Level of Difficulty 2 Stalkers 2 Stalkers 2s3z Easy 3 Zealots 3 Zealots 3 Stalkers 3 Stalkers 3s5z Easy 5 Zealots 5 Zealots 1 Colossus 1 Colossus 3 Stalkers 3 Stalkers 1c3s5z Easy 5 Zealots 5 Zealots 8m_vs_9m 8 Marines 9 Marines Hard 1 Medivac 1 Medivac 2 Marauders 3 Marauders MMM2 Super Hard 7 Marines 8 Marines 4 Banelings 4 Banelings bane_vs_bane Hard 20 Zerglings 20 Zerglings 25m 25 Marines 25 Marines Hard 27m_vs_30m 27 Marines 30 Marines Super Hard 4.2 Validation On every challenge, we have run each algorithm 2 million steps for 5 times with different random seeds and recorded the changes in win ratio. To reduce the contingency in validation, we take the average win ratio of the recent 15 tests as the current results. The performances of DAGMIX and the baselines are shown in Figure 3, where the solid line represents the median win ratio of the five experiments using the corresponding algorithm, and the 25-75% percentiles of the win ratios are shaded. Detailed win ratio data is displayed in Table 2. Among these baselines, QMIX is recognized as the SOTA method and the main competitor of DAGMIX due to its stability and effectiveness on most tasks. 10 G. Zhou et al. (a) 1c3s5z (b) 8m_vs_9m (c) MMM2 (d) bane_vs_bane (e) 25m (f) 27m_vs_30m Fig. 3. Overall results in different challenges. Table 2. Median performance and of the test win ratio in different scenarios. DAGMIX VDN QMIX QTRAN 1c3s5z 93.75 96.86 94.17 11.67 8m_vs_9m 92.71 94.17 94.17 70.63 MMM2 86.04 4.79 45.42 0.63 bane_vs_bane 100 94.79 99.79 99.58 25m 99.79 87.08 99.79 35.20 27m_vs_30m 49.58 9.58 33.33 18.33 It can be clearly observed that the performance of DAGMIX is very close to the best baseline in relatively easy challenges such as 1c3s5z and 8m_vs_9m. As the scenarios become more asymmetric and complex, DAGMIX begins to show its strengths and exceed the performances of the baselines. In bane_vs_bane and 25m, the win ratio curve of DAGMIX lies on the left of QMIX, indicating a faster convergence rate. MMM2 and 27m_vs_30m are classified as super hard scenarios on which most algorithms perform very poorly, due to the huge amount of agents and significant disparity between the opponent’s troops and ours. However, Figure 3(c) and 3(f) show that our method makes tremendous progress on the performance, as DAGMIX achieves unprecedented win ratios on these two tasks. Notably, the variance of DAGMIX’s training results is relatively small, indicating its robustness against random perturbations. 4.3 Ablation We conducted ablation experiments to demonstrate that the dynamic graph generated by the hard-attention mechanism is key to the success of DAGMIX. Mastering Complex Coordination through Attention-based Dynamic Graph (a) 2s3z (b) 8m_vs_9m (c) MMM2 (d) 27m_vs_30m 11 Fig. 4. Ablation experiments results. First, while many studies have improved original models by replacing the fully connected layer with the attention layer, we doubt if DAGMIX also benefits from such tricks, so we select Qatten [28] as the control group which also aggregates individual Q-values through attention mechanism but lacks a graph structure. Second, to confirm the effectiveness of the sparse dynamic graph over a fully connected one, we set all values in the adjacency matrix A to 1 which is referred to as Fully Connected Graph MIX (FCGMIX) for comparison. From Figure 4 we find Qatten gets a disastrous win ratio with high variance, indicating that simply adding an attention layer does not necessarily improve performance. When there are fewer agents, the fully connected graph is more comprehensive and does not have much higher complexity than the dynamic graph, so in 2s3z FCGMIX performs even slightly better than DAGMIX. However, as the number of agents increases and the complete graph becomes more complex, DAGMIX is gaining a clear advantage thanks to the optimized graph structure. It was noted in Section 3.2 that DAGMIX is a framework that enhances the base algorithm on large-scale problems. Here we combine VDN with our dynamic graph, denoted as DAGVDN, to investigate the improvements upon VDN, as well as DAGMIX upon QMIX. The results are presented in Figure 5. Similar to DAGMIX, DAGVDN outperforms VDN slightly in relatively easy tasks, while it shows significant superiority in more complex scenarios that VDN couldn’t handle. 12 G. Zhou et al. (a) 3s5z (b) 1c3s5z (c) 8m_vs_9m (d) MMM2 Fig. 5. The results of DAGMIX and DAGVDN compared to original algorithms. 5 Conclusion In this paper, we propose DAGMIX, a cooperative MARL algorithm based on value factorization. It combines the individual Q-values of all agents through operations on a real-time generated dynamic graph and provides a more interpretable and precise estimation of the global value to guide the training process. DAGMIX catches the intrinsic relationship between agents and allows end-toend learning of decentralized policies in a centralized manner. Experiments on SMAC show the prominent superiority of DAGMIX when dealing with large-scale and asymmetric problems. In other scenarios, DAGMIX still demonstrates very stable performance which is comparable to the best baselines. We believe DAGMIX provides a reliable solution to multi-agent coordination tasks. There are still points of improvement for DAGMIX. First, it seems practical to apply multi-head attention in later implementation. Besides, the dynamic graph in DAGMIX is not always superior to the complete graph when dealing with problems with fewer agents. On the basis of the promising findings presented in this paper, work on the remaining issues is continuing and will be presented in future papers. Acknowledgements This work was supported by the Strategic Priority Research Program of the Chinese Academy of Science, Grant No.XDA27050100. Mastering Complex Coordination through Attention-based Dynamic Graph 13 References 1. Berner, C., Brockman, G., Chan, B., Cheung, V., Dębiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C., et al.: Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680 (2019) 2. Ha, D., Dai, A., Le, Q.V.: Hypernetworks. arXiv preprint arXiv:1609.09106 (2016) 3. Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large graphs. Advances in neural information processing systems 30 (2017) 4. Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016) 5. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016) 6. Kortvelesy, R., Prorok, A.: Qgnn: Value function factorisation with graph neural networks. arXiv preprint arXiv:2205.13005 (2022) 7. Li, S., Gupta, J.K., Morales, P., Allen, R., Kochenderfer, M.J.: Deep implicit coordination graphs for multi-agent reinforcement learning. arXiv preprint arXiv:2006.11438 (2020) 8. Liu, Y., Wang, W., Hu, Y., Hao, J., Chen, X., Gao, Y.: Multi-agent game abstraction via graph attention neural network. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 34, pp. 7211–7218 (2020) 9. Lowe, R., Wu, Y.I., Tamar, A., Harb, J., Pieter Abbeel, O., Mordatch, I.: Multiagent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems 30 (2017) 10. Mahajan, A., Rashid, T., Samvelyan, M., Whiteson, S.: Maven: Multi-agent variational exploration. Advances in Neural Information Processing Systems 32 (2019) 11. Malysheva, A., Kudenko, D., Shpilman, A.: Magnet: Multi-agent graph network for deep multi-agent reinforcement learning. In: 2019 XVI International Symposium" Problems of Redundancy in Information and Control Systems"(REDUNDANCY). pp. 171–176. IEEE (2019) 12. Oliehoek, F.A., Amato, C.: A concise introduction to decentralized POMDPs. Springer (2016) 13. Rashid, T., Farquhar, G., Peng, B., Whiteson, S.: Weighted qmix: Expanding monotonic value function factorisation for deep multi-agent reinforcement learning. Advances in neural information processing systems 33, 10199–10210 (2020) 14. Rashid, T., Samvelyan, M., Schroeder, C., Farquhar, G., Foerster, J., Whiteson, S.: Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. In: International conference on machine learning. pp. 4295–4304. PMLR (2018) 15. Samvelyan, M., Rashid, T., De Witt, C.S., Farquhar, G., Nardelli, N., Rudner, T.G., Hung, C.M., Torr, P.H., Foerster, J., Whiteson, S.: The starcraft multi-agent challenge. arXiv preprint arXiv:1902.04043 (2019) 16. Shamsoshoara, A., Khaledi, M., Afghah, F., Razi, A., Ashdown, J.: Distributed cooperative spectrum sharing in uav networks using multi-agent reinforcement learning. In: 2019 16th IEEE Annual Consumer Communications & Networking Conference (CCNC). pp. 1–6. IEEE (2019) 17. Son, K., Kim, D., Kang, W.J., Hostallero, D., Yi, Y.: QTRAN: learning to factorize with transformation for cooperative multi-agent reinforcement learning. CoRR abs/1905.05408 (2019), http://arxiv.org/abs/1905.05408 18. Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W.M., Zambaldi, V., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo, J.Z., Tuyls, K., et al.: Value-decomposition 14 G. Zhou et al. networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296 (2017) 19. Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction. MIT press (2018) 20. Tampuu, A., Matiisen, T., Kodelja, D., Kuzovkin, I., Korjus, K., Aru, J., Aru, J., Vicente, R.: Multiagent cooperation and competition with deep reinforcement learning. PloS one 12(4), e0172395 (2017) 21. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017) 22. Vinyals, O., Babuschkin, I., Czarnecki, W.M., Mathieu, M., Dudzik, A., Chung, J., Choi, D.H., Powell, R., Ewalds, T., Georgiev, P., et al.: Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature 575(7782), 350–354 (2019) 23. Wang, J., Ren, Z., Liu, T., Yu, Y., Zhang, C.: Qplex: Duplex dueling multi-agent q-learning. arXiv preprint arXiv:2008.01062 (2020) 24. Wang, T., Dong, H., Lesser, V., Zhang, C.: Roma: Multi-agent reinforcement learning with emergent roles. arXiv preprint arXiv:2003.08039 (2020) 25. Wang, T., Gupta, T., Mahajan, A., Peng, B., Whiteson, S., Zhang, C.: Rode: Learning roles to decompose multi-agent tasks. arXiv preprint arXiv:2010.01523 (2020) 26. Xu, D., Chen, G.: Autonomous and cooperative control of uav cluster with multiagent reinforcement learning. The Aeronautical Journal 126(1300), 932–951 (2022) 27. Xu, Z., Zhang, B., Bai, Y., Li, D., Fan, G.: Learning to coordinate via multiple graph neural networks. In: International Conference on Neural Information Processing. pp. 52–63. Springer (2021) 28. Yang, Y., Hao, J., Liao, B., Shao, K., Chen, G., Liu, W., Tang, H.: Qatten: A general framework for cooperative multiagent reinforcement learning. arXiv preprint arXiv:2002.03939 (2020) 29. Ye, D., Chen, G., Zhang, W., Chen, S., Yuan, B., Liu, B., Chen, J., Liu, Z., Qiu, F., Yu, H., et al.: Towards playing full moba games with deep reinforcement learning. Advances in Neural Information Processing Systems 33, 621–632 (2020)