Inference-based Rewards for Reinforcement Learning

Inference-based Rewards for Reinforcement Learning

ICLR 2026 Conference Submission21070 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Reward Inference, Vision–Language Models, Human Intent Alignment, Policy Optimization

TL;DR: We analyze when inferred reward signals can support reinforcement learning, showing that monotonicity enables stable learning and that trajectory-based methods are more robust than step-wise approaches.

Abstract: A central challenge in reinforcement learning (RL) is defining reward signals that reliably capture human values and intentions. Recent advances in vision–language models (VLMs) suggest they can serve as a powerful source of semantic rewards, offering a flexible alternative to environment-defined objectives. Unlike hand-crafted signals, VLM-based feedback can reflect high-level human goals such as safety, efficiency, and comfort. We first analyze the conditions under which VLM-based rewards enable effective learning. In particular, we highlight the importance of monotonicity with respect to true task performance and the satisfaction of the Markov property. When these conditions hold, VLMs provide a viable basis for reward inference. On the algorithmic side, we identify what learning strategies are best suited for such rewards. Trajectory-based methods such as policy gradient (e.g., PPO) are naturally aligned with inferred returns, whereas Q-learning style algorithms are more fragile because they operate on step-wise Bellman updates (e.g., DQN) and implicitly assume the Markov property of rewards. This perspective reframes RL around reward inference rather than reward specification, highlighting both the promise of VLM-based alignment and the theoretical and practical boundaries of when such methods are effective. Experiments across control domains provide supporting evidence for these insights. In particular, monotonicity appears to align with learning outcomes, PPO shows greater robustness than DQN when trained with inferred rewards, and natural language prompts can guide the emergence of instruction-driven behaviors.

Primary Area: reinforcement learning

Submission Number: 21070

Loading