TL;DR: A general algorithm that provably minimizes policy regret in complex games with strategic opponents.
Abstract: We study policy-regret minimization problem in dynamically evolving environments, modeled as Markov games between a learner and a strategic, adaptive opponent. We propose a general algorithmic framework that achieves the optimal $\mathcal{O}(\sqrt{T})$ policy regret for a wide class of large-scale problems characterized by an Eluder-type condition--extending beyond the tabular settings of previous work. Importantly, our framework uncovers a simpler yet powerful algorithmic approach for handling reactive adversaries, demonstrating that leveraging opponent learning in such settings is key to attaining the optimal $\mathcal{O}(\sqrt{T})$ policy regret.
Lay Summary: In many real-world situations—from autonomous vehicles to online recommendation systems—multiple decision-makers (or agents) must interact with one another while learning how to make better choices. But when these agents have competing goals, learning can become difficult and unpredictable. In this work, we design a learning algorithm that helps a decision-maker perform well even when facing an opponent that learns and adapts over time. Unlike past work, our approach works in complex settings where the environment is too large to fully memorize and where the opponent's behavior must be approximated. We show that our algorithm learns to make good decisions over time and provides the first theoretical guarantees for success in such challenging situations. Our results could help build safer and more reliable AI systems that learn through repeated interaction.
Primary Area: Theory->Learning Theory
Keywords: policy regret, Markov games, strategic opponents, function approximation, online learning, reinforcement learning, adversarial learning, Eluder dimension, multi-agent learning
Submission Number: 13580
Loading