Sample Efficient Robust Offline Self-Play for Model-based Reinforcement Learning

Na Li; Zewu Zheng; Wei Ni; Hangguan Shan; Wenjie Zhang; Xinyu Li

Sample Efficient Robust Offline Self-Play for Model-based Reinforcement Learning

Na Li, Zewu Zheng, Wei Ni, Hangguan Shan, Wenjie Zhang, Xinyu Li

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: robust Markov games, self-play, distribution shift, model uncertainty, reinforcement learning

TL;DR: We design an algorithm that first achieves the optimal upper bound under partial coverage and environment uncertainty in robust two-player zero-sum Markov games.

Abstract: Multi-agent reinforcement learning (MARL), as a thriving field, explores how multiple agents independently make decisions in a shared dynamic environment. Due to environmental uncertainties, policies in MARL must remain robust to tackle the sim-to-real gap. Although robust RL has been extensively explored in single-agent settings, it has seldom received attention in self-play, where strategic interactions heighten uncertainties. We focus on robust two-player zero-sum Markov games (TZMGs) in offline RL, specifically on tabular robust TZMGs (RTZMGs) with a given uncertainty set. To address sample scarcity, we introduce a model-based algorithm (*RTZ-VI-LCB*) for RTZMGs, which integrates robust value iteration considering uncertainty level and applies a data-driven penalty to the robust value estimates. We establish the finite-sample complexity of RTZ-VI-LCB by accounting for distribution shifts in the historical dataset. Our algorithm is capable of learning under partial coverage and environmental uncertainty. An information-theoretic lower bound is developed to show that learning RTZMGs is at least as difficult as standard TZMGs when the uncertainty level is sufficiently small. This confirms the tightness of our algorithm's sample complexity, which is optimal regarding both state and action spaces. To the best of our knowledge, our algorithm is the first to attain this optimality and establishes a new benchmark for offline RTZMGs. We also extend our algorithm to multi-agent general-sum Markov games, achieving a breakthrough in breaking the curse of multiagency.

Primary Area: learning theory

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7040

Loading