Keywords: offline reinforcement learning
Abstract: Offline reinforcement learning suffers from extrapolation error in the Q-value function. In addition, most methods enforce a consistent constraint on the policy during training, regardless of its out-of-distribution level. We propose pessimistic policy iteration, which guarantees that the Q-value evaluation error is small under the trained policy's distribution and bounds the suboptimality gap of the trained policy's value function. At the same time, pessimistic policy iteration's core component is a horizon-flexible uncertainty quantifier, which could set a constraint according to regional uncertainty. The empirical study shows that the proposed method could boost the performance of baseline methods and is robust to the scale of the constraint. Also, a flexible horizon of uncertainty is necessary to identify out-of-distribution regions.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Reinforcement Learning (eg, decision and control, planning, hierarchical RL, robotics)