Trajectory-level Data Generation with Better Alignment for Offline Imitation Learning

ICLR 2025 Conference Submission10355 Authors

27 Sept 2024 (modified: 28 Nov 2024)ICLR 2025 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: offline imitation learning, behavior cloning, trajectory-level data generation, high-alignment trajectories
Abstract: Offline reinforcement learning (RL) relies heavily on densely precise reward signals, which are labor-intensive and challenging to obtain in many real-world scenarios. To tackle this challenge, offline imitation learning (IL) extracts optimal policies from expert demonstrations and datasets without reward labels. However, the scarcity of expert data and the abundance of suboptimal trajectories within the dataset impede the application of supervised learning methods like behavior cloning (BC). While previous research has focused on learning importance weights for BC or reward functions to integrate with offline RL algorithms, these approaches often result in suboptimal policy performance due to training instabilities and inaccuracies in learned weights or rewards. To address this problem, we introduce Trajectory-level Data Generation with Better Alignment (TDGBA), an algorithm that leverages alignment measures between unlabeled trajectories and expert demonstrations to guide a diffusion model in generating highly aligned trajectories. The aforementioned trajectories allow for the direct application of BC in order to extract optimal policies, negating the necessity for weight or reward learning. In particular, we define implicit expert preferences and, without the necessity of an additional human preference dataset, effectively utilise expert demonstrations to identify the preferences of unlabeled trajectories. Experimental results on the D4RL benchmarks demonstrate that TDGBA significantly outperforms state-of-the-art BC-based IL methods. Furthermore, we illustrate the efficacy of implicit expert preferences, which represents the inaugural application of the benefits of preference learning to offline IL.
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 10355
Loading