Analyzing the Sample Complexity of Model-Free Opponent Shaping

Published: 19 Jun 2023, Last Modified: 09 Jul 2023Frontiers4LCDEveryoneRevisionsBibTeX
Keywords: Opponent Shaping, Multi-Agent, Reinforcement Learning, Meta Reinforcement Learning, Sample Complexity, PAC-MDP
Abstract: In mixed-incentive multi-agent environments, methods developed for zero-sum games often yield collectively sub-optimal results. Addressing this, \textit{opponent shaping} (OS) strategies aim to actively guide the learning processes of other agents, empirically leading to enhanced individual and group performances. Early OS methods use higher-order derivatives to shape the learning of co-players, making them unsuitable to anticipate multiple learning steps ahead. Follow-up work Model-free Opponent Shaping (M-FOS) addresses the shortcomings of earlier OS methods by reframing the OS problem into a meta-game. In the meta-game, the meta-step corresponds to an episode of the ``inner'' game. The OS meta-state corresponds to the inner policies, while the meta-policy outputs an inner policy at each meta-step. Leveraging model-free optimization techniques, M-FOS learns meta-policies that demonstrate long-horizon opponent shaping, e.g., by discovering a novel extortion strategy in the Iterated Prisoner's Dilemma (IPD). In contrast to early OS methods, there is little theoretical understanding of the M-FOS framework. In this work, we derive the sample complexity bounds for the M-FOS agents theoretically and empirically. To quantify the sample complexity, we adapt the $R_{max}$ algorithm, most prominently used to derive sample bounds for MDPs, as the meta-learner in the M-FOS framework and derive an exponential sample complexity. Our theoretical results are empirically supported in the Matching Pennies environment.
Submission Number: 136