Keywords: Partially Observable Markov Games, Policy Regret, Weakly-Revealing, Observable Operator Model
TL;DR: Policy Regret Minimization in Partially Observable Markov Games
Abstract: We study policy regret minimization in partially observable Markov games (POMGs) between a learner and a strategic adaptive opponent who adapts to the learner's past strategies. We develop a model-based optimistic framework that operates on the learner-observable process using \emph{joint} MLE confidence set and introduce an Observable Operator Model-based causal decomposition that disentangles the coupling between the world and the adversary model. Under multi-step weakly revealing observations and a bounded-memory, stationary and posterior-Lipschitz opponent, we prove an $\mathcal{O}(\sqrt{T})$ policy regret bound. This work advances regret analysis from Markov games to POMGs and provides the first policy regret guarantee under imperfect information against an adaptive opponent.
Primary Area: learning theory
Submission Number: 21484
Loading