Keywords: Large Language Models; Agent Reasoning; Reward Modelling
Abstract: Large Language Models (LLMs) can operate as agents that interleave reasoning, action, and observation. Training them with reinforcement learning (RL) in multi-turn scenarios remains challenging due to long horizons and sparse terminal rewards, where this setting provides limited guidance for intermediate states, leading to unreliable credit assignment and diluted token-level updates. We address these limitations by proposing *RewardFlow*, a graph-based framework for reward modeling that represents agentic contexts as graphs, with states as nodes and actions as edges. RewardFlow constructs a state graph from multiple rollouts and propagates terminal rewards from successful states to all visited states using graph propagation methods such as Breadth-First Search and Personalized PageRank. This produces dense, state-wise, task-centric reward signals that indicate whether actions move the agent closer to or farther from success. Across text and visual domains on four challenging agent environments and three model sizes, RewardFlow consistently improves task success and training efficiency over strong group-based RL baselines. These results show that RewardFlow is a simple, scalable, and effective framework for mitigating sparse-reward credit assignment in agentic RL.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 23102
Loading