Abstract: Existing LLM-based agents have achieved strong performance on held-in tasks, but their generalizability to unseen tasks remains poor. Hence, some recent work focus on fine-tuning the policy model with more diverse tasks to improve the generalizability. In this work,
we find that finetuning a reward model to guide the policy model is more robust than directly finetuning the policy model.
Based on this finding, we propose AgentRM, a generalizable reward model, to guide the policy model for effective test-time search.
We comprehensively investigate three approaches to construct the reward model, including explicit reward modeling, implicit reward modeling and LLM-as-a-judge.
We then use AgentRM to guide the answer generation with Best-of-N sampling and step-level beam search.
On nine agent tasks, AgentRM enhances the base policy model by $8.8$ points on average, surpassing the top general agent by $4.0$ points.
Moreover, AgentRM can outperform the top specialized agent by $11.4$ points on held-in tasks with a specialized policy model.
All the data and source codes will be released to facilitate the research in this area.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: applications
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models
Languages Studied: english
Submission Number: 8043
Loading