Offline Imitation Learning without Auxiliary High-quality Behavior Data

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: imitation learning, offline imitation learning, offline reinforcement learning
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: In this work, we study the problem of Offline Imitation Learning (OIL), where an agent aims to learn from the demonstrations composed of expert behaviors and sub-optimal behaviors without additional online environment interactions. Previous studies typically assume that there is high-quality behavioral data mixed in the auxiliary offline data and seriously degrades when only low-quality data from an off-policy distribution is available. In this work, we break through the bottleneck of OIL relying on auxiliary high-quality behavior data and make the first attempt to demonstrate that low-quality data is also helpful for OIL. Specifically, we utilize the transition information from offline data to maximize the policy transition probability towards expert-observed states. This guidance can improve long-term returns on states that are not observed by experts when reward signals are not available, ultimately enabling imitation learning to benefit from low-quality data. We instantiate our proposition in a simple but effective algorithm, Behavioral Cloning with Dynamic Programming (BCDP), which involves executing behavioral cloning on the expert data and dynamic programming on the unlabeled offline data respectively. In the experiments on benchmark tasks, unlike most existing offline imitation learning methods that do not utilize low-quality data sufficiently, our BCDP algorithm can still achieve an average performance gain of more than 40\% even when the offline data is purely random exploration.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7676
Loading