Can Agents Learn Safe Behavior From Non-Preferred Demonstrations?

Published: 08 May 2026, Last Modified: 08 May 2026ICRA 2026 Workshop RL4IL PosterEveryoneRevisionsCC BY 4.0
Keywords: Inverse RL, Flow Matching, Preference Learn- ing, Imitation Learning, Offline RL
Abstract: Safe Reinforcement Learning (RL) applications, such as autonomous vehicles and robotic manipulation, require policies that avoid constraint violations while achieving task objectives. A key challenge is data scarcity: while large volumes of unlabeled operational data and identifiable failure cases are readily accessible, curated demonstrations that are both high-quality and verifiably safe are prohibitively scarce to collect. This data imbalance makes it essential to extract safe behaviors from heterogeneous datasets rather than relying exclusively on expert data. Existing methods, such as preference-based methods, suffer from cascading multi-stage errors, while safe imitation learning methods like SafeDICE require computationally expensive procedures to separate safe from unsafe behaviors. We propose Negative-Observation Preference Extraction (NOPE), a single-phase algorithm that integrates implicit preference learning with continuous-flow matching policy. NOPE leverages the Inverse Bellman Operator to extract implicit reward signals from preferences and weights the conditional flow matching objective with the resulting cumulative action-value estimates(Q values), incorporating safety constraints directly into the vector field without explicit reward models or the computational overhead of gradient-based guidance during inference. Experiments on navigation and velocity constraint tasks from the DSRL benchmark show that NOPE satisfies the safety constraints while achieving high returns. The dataset ablations confirm robustness to limited negative data and dataset heterogeneity. NOPE achieves high reward trajectories while being 1.64 times safer than the baselines on average.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 16
Loading