POIL: Preference Optimization for Imitation Learning

Chang Chih Meng; Renjyun Huang; Kuanyen Liu; I-Chen Wu

POIL: Preference Optimization for Imitation Learning

Chang Chih Meng, Renjyun Huang, Kuanyen Liu, I-Chen Wu

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Offline Imitation Learning, Preference-based Reinforcement Learning, Large Language Model Alignment, Data Efficiency

TL;DR: POIL is a novel offline imitation learning method inspired by preference optimization in language models. It delivers superior or competitive performance against SOTA methods in Mujoco task by eliminating adversarial training and reference models.

Abstract: Imitation learning (IL) enables agents to learn policies by mimicking expert demonstrations. While online IL methods require interaction with the environment, which is costly, risky, or impractical, offline IL allows agents to learn solely from expert datasets without any interaction with the environment. In this paper, we propose Preference Optimization for Imitation Learning (POIL), a novel approach inspired by preference optimization techniques in large language model alignment. POIL eliminates the need for adversarial training and reference models by directly comparing the agent's actions to expert actions using a preference-based loss function. We evaluate POIL on MuJoCo control tasks under two challenging settings: learning from a single expert demonstration and training with different dataset sizes (100\%, 10\%, 5\%, and 2\%) from the D4RL benchmark. Our experiments show that POIL consistently delivers superior or competitive performance against state-of-the-art methods in the past, including Behavioral Cloning (BC), IQ-Learn, DMIL, and O-DICE, especially in data-scarce scenarios, such as using one expert trajectory or as little as 2\% of the full expert dataset. These results demonstrate that POIL enhances data efficiency and stability in offline imitation learning, making it a promising solution for applications where environment interaction is infeasible and expert data is limited.

Supplementary Material: zip

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6549

Loading