Pessimistic Model-Based Actor-Critic for Offline Reinforcement Learning: Theory and AlgorithmsDownload PDF

Keywords: Actor-critic, Model-based offline RL, PAC guarantee, Pessimism
Abstract: Model-based offline reinforcement learning (RL) has achieved superior performance than model-free RL in many decision-making problems due to its sample efficiency and generalizability. However, prior model-based offline RL methods in the literature either demonstrate their successes only through empirical studies, or provide algorithms that have theoretical guarantees but are hard to implement in practice. To date, a general computationally-tractable algorithm for model-based offline RL with PAC guarantees is still lacking. To fill this gap, we develop a pessimistic model-based actor-critic (PeMACO) algorithm with general function approximations assuming partial coverage of the offline dataset. Specifically, the critic provides a pessimistic Q-function through incorporating uncertainties of the learned transition model, and the actor updates policies by employing approximations of the pessimistic Q-function. Under some mild assumptions, we establish theoretical PAC guarantees of the proposed PeMACO algorithm by proving upper bounds on the suboptimality of the returned policy by PeMACO.
