Designing Near-Optimal Partially Observable Reinforcement Learning

Ming Shi, Yingbin Liang, Ness B. Shroff

Published: 2024, Last Modified: 16 May 2025MILCOM 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Partially observable Markov decision processes (POMDPs) have been widely applied in various real-world applications. However, existing results have shown that learning in POMDPs is intractable in the worst case. The main challenge lies in the lack of latent state information. For example, in wireless channel scheduling, due to energy and security constraints, it is usually difficult or impossible for the user to know the conditions/states of all channels. Thus, a key fundamental question here is: how much online state information (OSI) is sufficient to achieve tractability? In this paper, we make the first effort to establish fundamental conditions and methods for bridging the gap between partially observable reinforcement learning and networking with incomplete state information. Specifically, we establish a lower bound that reveals a surprising hardness result: unless we have full OSI, we need an exponentially scaling sample complexity to obtain an ϵ-optimal policy solution for POMDPs. Nonetheless, motivated by the structures of practical systems, we identify important subclasses of POMDPs that are tractable, even with only partial OSI. For two subclasses of POMDPs with partial OSI, we provide new algorithms that are proved to be near-optimal by establishing new regret upper and lower bounds.