A Smooth Sea Never Made a Skilled SAILOR: Robust Imitation via Learning to Search

Arnav Kumar Jain; Vibhakar Mohta; Subin Kim; Atiksh Bhardwaj; Juntao Ren; Yunhai Feng; Sanjiban Choudhury; Gokul Swamy

A Smooth Sea Never Made a Skilled SAILOR: Robust Imitation via Learning to Search

Arnav Kumar Jain, Vibhakar Mohta, Subin Kim, Atiksh Bhardwaj, Juntao Ren, Yunhai Feng, Sanjiban Choudhury, Gokul Swamy

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 spotlightEveryoneRevisionsBibTeXCC BY-SA 4.0

Keywords: Imitation learning, world models, inverse reinforcement learning

TL;DR: Rather than directly learning a policy from expert demonstrations, we instead learn world and reward models, allowing us to search at test-time and recover from mistakes.

Abstract: The fundamental limitation of the behavioral cloning (BC) approach to imitation learning is that it only teaches an agent what the expert did at states the expert visited. This means that when a BC agent makes a mistake which takes them out of the support of the demonstrations, they often don't know how to recover from it. In this sense, BC is akin to *giving the agent the fish* -- giving them dense supervision across a narrow set of states -- rather than teaching them *to fish*: to be able to reason independently about achieving the expert's outcome even when faced with unseen situations at test-time. In response, we explore *learning to search* (L2S) from expert demonstrations, i.e. learning the components required to, at test time, plan to match expert outcomes, even after making a mistake. These include *(1)* a world model and *(2)* a reward model. We carefully ablate the set of algorithmic and design decisions required to combine these and other components for stable and sample/interaction-efficient learning of recovery behavior without additional human corrections. Across a dozen visual manipulation tasks from three benchmarks, our approach SAILOR consistently out-performs state-of-the-art Diffusion Policies trained via BC on the same data. Furthermore, scaling up the amount of demonstrations used for BC by 5-10x still leaves a performance gap. We find that SAILOR can identify nuanced failures and is robust to reward hacking. Our code is available at [https://github.com/arnavkj1995/SAILOR](https://github.com/arnavkj1995/SAILOR).

Primary Area: Reinforcement learning (e.g., decision and control, planning, hierarchical RL, robotics)

Submission Number: 18940

Loading