Abstract: Model-based offline reinforcement learning methods (RL) have achieved state-of-the-art performance in many decision-making problems thanks to their sample efficiency and generalizability. Despite these advancements, existing model-based offline RL approaches either focus on theoretical studies without developing practical algorithms or rely on a restricted parametric policy space, thus not fully leveraging the advantages of an unrestricted policy space inherent to model-based methods. To address this limitation, we develop MoMA, a model-based mirror ascent algorithm with general function approximations under partial coverage of offline data. MoMA distinguishes itself from existing literature by employing an unrestricted policy class. In each iteration, MoMA conservatively estimates the value function by a minimization procedure within a confidence set of transition models in the policy evaluation step, then updates the policy with general function approximations instead of commonly-used parametric policy classes in the policy improvement step. Under some mild assumptions, we establish theoretical guarantees for MoMA by proving an upper bound on the suboptimality of the returned policy.
We also provide a practically implementable, approximate version of the algorithm. The effectiveness of MoMA is demonstrated via numerical studies.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: We would like to thank the Action Editor and the reviewers for their constructive comments. We have addressed the reviewers’ concerns, which we believe have greatly improved our paper. Below is a summary of the major changes:
1. Enhanced Clarity. We have improved explanations of theoretical concepts and provided intuitive insights into the algorithms. Specifically, we moved the discussion comparing MoMA to existing offline RL work into the main text. In Section 3, we expanded on the motivation behind our algorithm’s design, providing clearer rationale. We also added more intuition and definitions for symbols and notations, restructured the paper to avoid excessive forward references, and included examples alongside key concepts to enhance understanding. These changes make the paper more cohesive and easier to follow.
2. Strengthened Empirical Validation. We have included additional experiments in numerical studies, detailed in the Appendix, to empirically validate the benefits of the unrestricted policy class, highlighting scenarios where parametric policies fall short.
3. Balanced Discussion. We have provided a more balanced discussion of the results at the end of the paper, acknowledging instances where MoMA does not outperform existing methods and offering potential explanations for these outcomes.
Code: https://github.com/YanxunXu/MOMA
Assigned Action Editor: ~Shixiang_Gu1
Submission Number: 2076
Loading