Keywords: Actor-critic, Model-based offline RL, PAC guarantee, Pessimism
Abstract: Model-based offline reinforcement learning (RL) has achieved superior performance than model-free RL in many decision-making problems due to its sample efficiency and generalizability. However, prior model-based offline RL methods in the literature either demonstrate their successes only through empirical studies, or provide algorithms that have theoretical guarantees but are hard to implement in practice. To date, a general computationally-tractable algorithm for model-based offline RL with PAC guarantees is still lacking. To fill this gap, we develop a pessimistic model-based actor-critic (PeMACO) algorithm with general function approximations assuming partial coverage of the offline dataset. Specifically, the critic provides a pessimistic Q-function through incorporating uncertainties of the learned transition model, and the actor updates policies by employing approximations of the pessimistic Q-function. Under some mild assumptions, we establish theoretical PAC guarantees of the proposed PeMACO algorithm by proving upper bounds on the suboptimality of the returned policy by PeMACO.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Reinforcement Learning (eg, decision and control, planning, hierarchical RL, robotics)