Dual-Policy Architecture for Multi-Agent Exploration

Dual-Policy Architecture for Multi-Agent Exploration

TMLR Paper4130 Authors

03 Feb 2025 (modified: 20 Mar 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: When training an agent using Reinforcement Learning, an efficient exploration strategy is essential to achieve good results. Multi-Agent Reinforcement Learning introduces additional challenges that require efficient exploration in order to find a set of policies that is able to achieve the goal. This is caused by the fact that agents may depend on each other to be successful. State-of-the-art works combine the exploitation and exploration behaviour into a single policy. Instead, we propose the use of a dual-policy architecture, where we separate the exploration policy from the exploitation policy. We present two different ways to accomplish such a dual-policy architecture, Weighted-Q Dual-Policy (WQ-DP) and ϵ-Sampled Dual-Policy (ϵS-DP). WQ-DP uses an approach more similar to previous works, using a weighted sum of the Q-values produced by the exploitation and the exploration policies to choose an action. ϵS-DP samples between the exploitation and exploration policy based on the ϵ parameter that varies during training. Our results show that agents that use a dual-policy architecture outperform agents that combine the exploitation and exploration policies. ϵS-DP shows the best results when comparing the tested architectures. Further experiments show that the policy sampling period in ϵS-DP greatly contributes to its superior performance.

Submission Length: Long submission (more than 12 pages of main content)

Assigned Action Editor: ~Pablo_Samuel_Castro1

Submission Number: 4130

Loading