HALO : Human Preference Aligned Offline Reward Learning for Robot Navigation

Gershom Seneviratne; Jianyu An; Sahire Ellahy; Kasun Weerakoon; Mohamed Bashir Elnoor; Jonathan Deepak Kannan; Amogha Thalihalla Sunil; Dinesh Manocha

HALO : Human Preference Aligned Offline Reward Learning for Robot Navigation

Gershom Seneviratne, Jianyu An, Sahire Ellahy, Kasun Weerakoon, Mohamed Bashir Elnoor, Jonathan Deepak Kannan, Amogha Thalihalla Sunil, Dinesh Manocha

Published: 08 Aug 2025, Last Modified: 16 Sept 2025CoRL 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reward Modelling, Preference Alignment, Vision-based Navigation

TL;DR: HALO learns a vision-based reward model from offline human preferences, enabling vision based navigation that generalizes across varying scenarios and outperforms state of the art methods.

Abstract: In this paper, we introduce HALO, a novel Offline Reward Learning algorithm that quantifies human intuition in navigation into a vision-based reward function for robot navigation. HALO learns a reward model from offline data, leveraging expert trajectories collected from mobile robots. During training, actions are randomly sampled from the action space around the expert action and ranked using a Boltzmann probability distribution that combines their distance to the expert action with human preference scores derived from intuitive navigation queries based on the corresponding egocentric camera feed. These scores establish preference rankings, enabling the training of a novel reward model based on Plackett-Luce loss, which allows for preference-driven navigation. To demonstrate the effectiveness of HALO, we deploy its reward model in two downstream applications: (i) an offline learned policy trained directly on the HALO-derived rewards, and (ii) a model-predictive-control (MPC) based planner that incorporates the HALO reward as an additional cost term. This showcases the versatility of HALO across both learning-based and classical navigation frameworks. Our real-world deployments on a Clearpath Husky across multiple scenarios demonstrate that policies trained with HALO achieve improved performance over state-of-the-art methods in terms of success rate and normalized trajectory length while maintaining lower Fréchet distance with the human expert trajectories.

Supplementary Material: zip

Spotlight: mp4

Submission Number: 997

Loading