Effective Offline Environment Reconstruction when the Dataset is Collected from Diversified Behavior Policies

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: environment model learning, offline reinforcement learning, off-policy evaluation, offline policy selection, model predictive control
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We propose a new environment model learning techniques with better generalization ability utilizing the implicit multi-source nature of the offline dataset.
Abstract: In reinforcement learning, it is crucial to have an accurate environment dynamics model to evaluate different policies' value in tasks like offline policy optimization and policy evaluation. However, the learned model is known to have large value gaps when evaluating target policies different from data-collection policies. This issue has hindered the wide adoption of models as various policies are needed for evaluation in these downstream tasks. In this paper, we focus on one of the typical offline environment model learning scenarios where the offline dataset is collected from diversified policies. We utilize an implicit multi-source nature in this scenario and propose an easy-to-implement yet effective algorithm, policy-conditioned model (PCM) learning, for accurate model learning. PCM is a meta-dynamics model that is trained to be aware of the evaluation policies and on-the-fly adjust the model to match the evaluation policies’ state-action distribution to improve the prediction accuracy. We give a theoretical analysis and experimental evidence to demonstrate the feasibility of reducing value gaps by adapting the dynamics model under different policies. Experiment results show that PCM outperforms the existing SOTA off-policy evaluation methods in the DOPE benchmark with \textit{a large margin}, and derives significantly better policies in offline policy selection and model predictive control compared with the standard model learning method.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7281
Loading