Robust Reward Alignment via Hypothesis Space Batch Cutting

Zhixian Xie; Haode Zhang; Yizhe Feng; Wanxin Jin

Robust Reward Alignment via Hypothesis Space Batch Cutting

Zhixian Xie, Haode Zhang, Yizhe Feng, Wanxin Jin

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Reward design in reinforcement learning and optimal control is challenging. Preference-based alignment addresses this by enabling agents to learn rewards from ranked trajectory pairs provided by humans. However, existing methods often struggle from poor robustness to unknown false human preferences. In this work, we propose a robust and efficient reward alignment method based on a novel and geometrically interpretable perspective: hypothesis space batched cutting. Our method iteratively refines the reward hypothesis space through “cuts” based on batches of human preferences. Within each batch, human preferences, queried based on disagreement, are grouped using a voting function to determine the appropriate cut, ensuring a bounded human query complexity. To handle unknown erroneous preferences, we introduce a conservative cutting method within each batch, preventing erroneous human preferences from making overly aggressive cuts to the hypothesis space. This guarantees provable robustness against false preferences, while eliminating the need to explicitly identify them. We evaluate our method in a model predictive control setting across diverse tasks. The results demonstrate that our framework achieves comparable or superior performance to state-of-the-art methods in error-free settings while significantly outperforming existing methods when handling a high percentage of erroneous human preferences.

Lay Summary: Teaching AI agents to make good decisions—like balancing a robot or driving a car—often requires giving them a "reward function" that tells them how well they are doing. But designing this reward function by hand is difficult and laborious. A promising alternative is to learn the reward from human preference feedback. However, this approach struggles when people make mistakes in their rankings, which is common. We developed a new method that helps machines learn from human preferences, even when some of them are wrong. Our idea is to think of all the possible reward functions as a space, and gradually narrow it down using batches of human preferences. To make this process robust, we designed a way to be conservative when human preferences are inconsistent, so that mistakes don’t lead the machine in the wrong direction. Our approach helps machines learn more safely and effectively from humans, without requiring perfect feedback. This is especially important as AI systems are used more often in real-world settings, where human guidance is valuable but not always reliable. Our experiments show that our method not only works well when feedback is accurate, but also performs much better than existing techniques when there are lots of mistakes in the human input.

Link To Code: https://github.com/asu-iris/HSBC-Robust-Learning

Primary Area: Reinforcement Learning->Inverse

Keywords: Learning from Human Feedback, Inverse Reinforcement Learning, Preference Based Reinforcement Learning, Robust Learning

Submission Number: 13507

Loading