Learning Constraints from Offline Dataset via Inverse Dual Values Estimation

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Inverse Constrained Reinforcement Learning, Offline Reinforcement Learning, Dual Reinforcement Learning
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: In this paper, we propose a solution for offline Inverse Constrained Reinforcement learning (ICRL) by deriving dual value functions from regularized policy learning.
Abstract: To develop safe control strategies, Inverse Constrained Reinforcement learning (ICRL) infers constraints from expert demonstrations and trains policy models under these constraints. Classical ICRL algorithms typically adopt an online learning diagram that permits boundless exploration in an interactive environment. However, in realistic applications, iteratively collecting experiences from the environment is dangerous and expensive, especially for safe-critical control tasks. To address this challenge, in this work, we present a novel Inverse Dual Values Estimation (IDVE) framework. To enable offline ICRL, IDVE dynamically integrates the conservative estimation inherent in offline RL and the data-driven inference in inverse RL, thereby effectively learning constraints from limited data. Specifically, IDVE derives the dual values functions for both rewards and costs, estimating their values in a bi-level optimization problem based on the offline dataset. To derive a practical IDVE algorithm for offline constraint inference, we introduce the method of 1) tacking unknown transitions, 2) scaling to continuous environments, and 3) controlling the degree of constraint regularization. Under these advancements, empirical studies demonstrate that IDVE outperforms other baselines in terms of accurately recovering the constraints and adapting to high-dimensional environments with diverse reward configurations.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4933
Loading