Abstract: In this work, we decouple the iterative (bi-level) offline RL optimization from the offline training phase, forming a non-iterative bi-level learning paradigm that avoids the iterative error propagation over two levels. Specifically, this non-iterative paradigm allows us to conduct inner-level optimization in training (ie, employing policy/value regularization), while performing outer-level optimization in testing (ie, conducting policy inference). Naturally, such paradigm raises three core questions (that are not fully answered by prior non-iterative methods): (Q1) What information should we transfer from inner-level to outer-level? (Q2) What should we pay attention to when using the transferred information in outer-level optimization? (Q3) What are the benefits of concurrently conducting outer-level optimization during testing? Motivated by model-based optimization, we proposed DROP, which fully answered the above three questions. Particularly, in inner-level, DROP decomposes offline data into multiple subsets, and learns a score model (Q1). To keep safe exploitation to score model in outer-level, we explicitly learn a behavior embedding and introduce a conservative regularization (Q2). During testing, we show that DROP permits deployment adaptation, enabling an adaptive inference across states (Q3). Empirically, we evaluate DROP on various benchmarks, showing that DROP gains comparable or better performance compared to prior offline RL methods.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Reinforcement Learning (eg, decision and control, planning, hierarchical RL, robotics)
Supplementary Material: zip
26 Replies
Loading