AgentPRM: Scalable Process Reward Models for Language Model Agents via Step-Wise Promise and Progress
Abstract: Despite advancements in large language models (LLMs), they still face challenges in multi-turn decision-making tasks (i.e., agent tasks), where models need to make a sequence of intelligent decisions based on environment feedback. In this work, we explore training process reward models (PRMs) to evaluate each decision and guide model's search in agent tasks. Unlike LLM reasoning, where each step is scored based on correctness, actions in agent tasks do not have a clear-cut correctness. Instead, they should be evaluated based on their proximity to the goal and the progress they have made. Building on this insight, we redefine the PRM for agent tasks, and introduce AgentPRM to capture both the interdependence between sequential decisions and their contribution to the final goal. This allows for better progress tracking and exploration-exploitation balance. To scalably obtain labeled data for training AgentPRM, we employ a TD-based estimation method combined with Generalized Advantage Estimation (GAE). Experiments across sampling strategies, models and tasks demonstrate that our method consistently outperforms baselines, is more compute-efficient, and it exhibits a more stable and robust improvement trend as inference compute scales. Furthermore, it generalizes well on mathematical reasoning.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: LLM-based agents,decision-making,process reward models
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency, Data resources
Languages Studied: English
Submission Number: 7403
Loading