RewardFlow:  Propagating Reward in the State Graphs of Agentic Learning with LLMs

Xiao Feng; Bo Han; Zhanke Zhou; Jiaqi Fan; Jiangchao Yao; Ka Ho Li; Dahai Yu; Michael Ng

RewardFlow: Propagating Reward in the State Graphs of Agentic Learning with LLMs

Xiao Feng, Bo Han, Zhanke Zhou, Jiaqi Fan, Jiangchao Yao, Ka Ho Li, Dahai Yu, Michael Ng

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models; Agent Reasoning; Reward Modelling

Abstract: Large Language Models (LLMs) can operate as agents that interleave reasoning, action, and observation. Training them with reinforcement learning (RL) in multi-turn scenarios remains challenging due to long horizons and sparse terminal rewards, where this setting provides limited guidance for intermediate states, leading to unreliable credit assignment and diluted token-level updates. We address these limitations by proposing *RewardFlow*, a graph-based framework for reward modeling that represents agentic contexts as graphs, with states as nodes and actions as edges. RewardFlow constructs a state graph from multiple rollouts and propagates terminal rewards from successful states to all visited states using graph propagation methods such as Breadth-First Search and Personalized PageRank. This produces dense, state-wise, task-centric reward signals that indicate whether actions move the agent closer to or farther from success. Across text and visual domains on four challenging agent environments and three model sizes, RewardFlow consistently improves task success and training efficiency over strong group-based RL baselines. These results show that RewardFlow is a simple, scalable, and effective framework for mitigating sparse-reward credit assignment in agentic RL.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 23102

Loading