Provable Partially Observable Reinforcement Learning with Privileged Information

Yang Cai; Xiangyu Liu; Argyris Oikonomou; Kaiqing Zhang

Provable Partially Observable Reinforcement Learning with Privileged Information

Yang Cai, Xiangyu Liu, Argyris Oikonomou, Kaiqing Zhang

Published: 19 Jun 2024, Last Modified: 26 Jul 2024ARLET 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reinforcement learning, pomdp, partial observability, computational, privileged information, expert distillation, teacher-student learning

TL;DR: We study the provable efficiency of partially observable RL with privileged information to the latent state during training for both single-agent and multi-player settings.

Abstract: Partial observability of the underlying states generally presents significant challenges for reinforcement learning (RL). In practice, certain privileged information , e.g., the access to states from simulators, has been exploited in training and achieved prominent empirical successes. To better understand the benefits of privileged information, we revisit and examine several simple and practically used paradigms in this setting, with both computation and sample efficiency analyses. Specifically, we first formalize the empirical paradigm of expert distillation (also known as teacher-student learning), demonstrating its pitfall in finding near-optimal policies. We then identify a condition of the partially observable environment, the deterministic filter condition, under which expert distillation achieves sample and computational complexities that are both polynomial. Furthermore, we investigate another successful empirical paradigm of asymmetric actor-critic, and focus on the more challenging setting of observable partially observable Markov decision processes. We develop a belief-weighted optimistic asymmetric actor-critic algorithm with polynomial sample and quasi-polynomial computational complexities, where one key component is a new provable oracle for learning belief states that preserve filter stability under a misspecified model, which may be of independent interest. Finally, we also investigate the provable efficiency of partially observable multi-agent RL (MARL) with privileged information. We develop algorithms with the feature of centralized-training-with-decentralized-execution, a popular framework in empirical MARL, with polynomial sample and (quasi-)polynomial computational complexity in both paradigms above. Compared with a few recent related theoretical studies, our focus is on understanding practically inspired algorithmic paradigms, without computationally intractable oracles.

Submission Number: 80

Loading