Intrinsic Credit Assignment for Long Horizon Interaction

Published: 02 Mar 2026, Last Modified: 05 Mar 2026LLA 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning with Verifiable Rewards (RLVR), Long-Horizon Decision Making, Intrinsic Beliefs, Multi-Turn Information Seeking
TL;DR: We show that intrinsic beliefs provide a scalable training signal for improved credit assignment, and in turn promote more effective information seeking in long-horizon interactions.
Abstract: How can we train agents to navigate uncertainty over long horizons? In this work, we propose ∆Belief-RL, which leverages a language model's own intrinsic beliefs to reward intermediate progress. Our method utilizes the change in the probability an agent assigns to the target solution for credit assignment. By training on synthetic interaction data, ∆Belief-RL teaches information-seeking capabilities that consistently outperform purely outcome-based rewards for RL, with improvements generalizing to out-of-distribution applications ranging from customer service to personalization. Notably, the performance continues to improve as we scale test-time interactions beyond the training horizon, with interaction-efficiency increasing even on Pass@k metrics. Overall, our work introduces a scalable training strategy for navigating uncertainty over a long-horizon, by enabling credit assignment to intermediate actions via intrinsic ∆Belief rewards.
Submission Number: 110
Loading