Keywords: Multi-Agent Reinforcement Learning, Opponent Modeling, Multi-Armed Bandit, Imperfect Information, Noise
Abstract: Opponent Modeling (OM) is a powerful framework in Multi-Agent Reinforcement Learning (MARL) to anticipate and adapt to the strategies of other agents.
However, its success is highly dependent on the assumption of high-quality observations.
In many real-world applications, agents must operate under imperfect information that can lead to inaccurate model representations.
In this paper, we investigate the drawbacks of agents conditioning their policies on flawed opponent models that cause significant performance degradation compared to model-agnostic baselines.
To address this, we introduce Strategy Weighting for Adaptive Policies (SWAP), a novel adaptive framework that treats strategy selection as an online learning problem.
Employing the EXP4 algorithm, our agent treats a predictive OM-based policy and a robust conservative policy as competing experts, dynamically switching between them based on their observed performance.
Our experimental results demonstrate the advantages of adopting a conservative approach when information is flawed and using predictive modeling when information is reliable, outperforming state-of-the-art methods in these critical scenarios.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Paper Type: Standard paper
Submission Number: 44
Loading