A Policy Gradient Algorithm for Learning to Learn in Multiagent Reinforcement Learning

Dong-Ki Kim; Miao Liu; Matthew Riemer; Chuangchuang Sun; Marwa Abdulhai; Golnaz Habibi; Sebastian Lopez-Cot; Gerald Tesauro; JONATHAN P HOW

A Policy Gradient Algorithm for Learning to Learn in Multiagent Reinforcement Learning

Dong-Ki Kim, Miao Liu, Matthew Riemer, Chuangchuang Sun, Marwa Abdulhai, Golnaz Habibi, Sebastian Lopez-Cot, Gerald Tesauro, JONATHAN P HOW

28 Sept 2020 (modified: 22 Jun 2025)ICLR 2021 Conference Blind SubmissionReaders: Everyone

Keywords: Multiagent reinforcement learning, Meta-learning, Non-stationarity

Abstract: A fundamental challenge in multiagent reinforcement learning is to learn beneficial behaviors in a shared environment with other agents that are also simultaneously learning. In particular, each agent perceives the environment as effectively non-stationary due to the changing policies of other agents. Moreover, each agent is itself constantly learning, leading to natural nonstationarity in the distribution of experiences encountered. In this paper, we propose a novel meta-multiagent policy gradient theorem that directly accommodates for the non-stationary policy dynamics inherent to these multiagent settings. This is achieved by modeling our gradient updates to directly consider both an agent’s own non-stationary policy dynamics and the non-stationary policy dynamics of other agents interacting with it in the environment. We find that our theoretically grounded approach provides a general solution to the multiagent learning problem, which inherently combines key aspects of previous state of the art approaches on this topic. We test our method on several multiagent benchmarks and demonstrate a more efficient ability to adapt to new agents as they learn than previous related approaches across the spectrum of mixed incentive, competitive, and cooperative environments.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

One-sentence Summary: We introduce a novel meta-multiagent policy gradient theorem based on meta-learning that can adapt quickly to non-stationarity in the policies of other agents in the environment.

Supplementary Material: zip

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 4 code implementations](https://www.catalyzex.com/paper/a-policy-gradient-algorithm-for-learning-to/code)

Reviewed Version (pdf): https://openreview.net/references/pdf?id=4lt958ulaL

12 Replies

Loading