Optimistic Value Iteration with Representation Learning for Low-Rank POMGs

Chenhao Zhou; Lei Qian; Chao Zhang; Hanbin Zhao; Hui Qian

Optimistic Value Iteration with Representation Learning for Low-Rank POMGs

Chenhao Zhou, Lei Qian, Chao Zhang, Hanbin Zhao, Hui Qian

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-Agent Reinforcement Learning; Partially Observable Markov Game

Abstract: Partially Observable Markov Games (POMGs) pose significant challenges for multi-agent reinforcement learning due to the combination of partial observability and strategic interactions. Recent advances explore the inherent structure of the POMG dynamics and develop efficient representation methods to facilitate planning in the latent space rather than directly operating on the history trajectory. In this paper, we focus on the low-rank POMGs and propose a unified optimistic value iteration (OVI) framework that accommodates different low-rank representation learning methods. With a given representation, OVI constructs an optimistic bonus and integrates it into the value function to inspire exploration and mitigate the bias caused by the representation approximation error. When the exact value function oracle is unavailable, OVI instead utilizes the low-rank representation to construct optimistic/pessimistic estimators of the value functions via the Bellman recursion, and selects the final solution based on the optimistic-pessimistic gap. Our theoretical analysis shows that, once the representation approximation error is bounded, the OVI converges to an approximate equilibrium. We instantiate the framework with two provable representation learning methods: an MLE-based approach and a spectral decomposition representation method. Furthermore, we develop a novel representation method, $L$-step Latent Variable Representation (LLVR), for POMGs with infinite-dimensional latent spaces, i.e., infinite rank, and prove that OVI with LLVR also achieves approximate equilibria, with an extra $L$-decodability assumption. Collectively, these results establish the first systematic representation learning perspective for POMGs.

Primary Area: reinforcement learning

Submission Number: 5907

Loading