AgentPRM: Scalable Process Reward Models for Language Model Agents via Step-Wise Promise and Progress

AgentPRM: Scalable Process Reward Models for Language Model Agents via Step-Wise Promise and Progress

ACL ARR 2025 February Submission7403 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Despite advancements in large language models (LLMs), they still face challenges in multi-turn decision-making tasks (i.e., agent tasks), where models need to make a sequence of intelligent decisions based on environment feedback. In this work, we explore training process reward models (PRMs) to evaluate each decision and guide model's search in agent tasks. Unlike LLM reasoning, where each step is scored based on correctness, actions in agent tasks do not have a clear-cut correctness. Instead, they should be evaluated based on their proximity to the goal and the progress they have made. Building on this insight, we redefine the PRM for agent tasks, and introduce AgentPRM to capture both the interdependence between sequential decisions and their contribution to the final goal. This allows for better progress tracking and exploration-exploitation balance. To scalably obtain labeled data for training AgentPRM, we employ a TD-based estimation method combined with Generalized Advantage Estimation (GAE). Experiments across sampling strategies, models and tasks demonstrate that our method consistently outperforms baselines, is more compute-efficient, and it exhibits a more stable and robust improvement trend as inference compute scales. Furthermore, it generalizes well on mathematical reasoning.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: LLM-based agents,decision-making,process reward models

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency, Data resources

Languages Studied: English

Submission Number: 7403

Loading