Keywords: Infinite-Horizon, Average-Reward POMDPs, Spectral Learning, Regret Minimization
Abstract: We address the learning problem in the context of infinite-horizon average-reward POMDPs. Traditionally, this problem has been approached using $\textit{Spectral Decomposition}$ (SD) methods applied to samples collected under non-adaptive policies, such as uniform or round-robin policies. Recently, SD techniques have been extended to accommodate a restricted class of adaptive policies such as $\textit{memoryless policies}$. However, the use of adaptive policies has introduced challenges related to data inefficiency, as SD methods typically require all samples to be drawn from a single policy.
In this work, we propose $\texttt{Mixed Spectral Estimation}$, which generalizes spectral estimation techniques to support a broader class of $\textit{belief-based policies}$. We solve the open question of whether spectral methods can be applied to samples collected from multiple policies, and we provide finite-sample guarantees for our approach under standard observability and ergodicity assumptions.
Building on this data-efficient estimation method, we introduce the $\texttt{Mixed Spectral UCRL}$ algorithm. Through a refined theoretical analysis, we demonstrate that it achieves a regret bound of $\widetilde{\mathcal{O}}(\sqrt{T})$ when compared to the optimal policy, without requiring full knowledge of either the transition or the observation model. Finally, we present numerical simulations that validate the theoretical analysis of both the proposed estimation procedure and the $\texttt{Mixed Spectral UCRL}$ algorithm.
Primary Area: Reinforcement learning (e.g., decision and control, planning, hierarchical RL, robotics)
Submission Number: 15437
Loading