PARM: Pipeline-Adapted Reward Model

PARM: Pipeline-Adapted Reward Model

ACL ARR 2025 February Submission7590 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recently, the reward model has received more attention as it has made significant progress in improving the decoding quality of large language models and guiding reinforcement fine-tuning. Current research efforts have focused on reward models for individual models, but as the capability of large language models continues to grow, they are beginning to exert their capabilities as components of some process tasks or pipeline tasks, and reward models for such pipeline tasks remain to be explored. To bridge this gap, our work builds a pipeline based on task, code generation for optimization problem, and verifies the potential of reward model to improve the quality of pipeline output. Meanwhile, to address the problems generated by the reward model in improving pipeline output quality, we propose a simple and efficient training method to train the Pipeline-Adapted Reward Model (PARM) to further improve its effectiveness. Through performance experiments on four benchmarks and a series of analysis experiments, we validate the effectiveness of PARM and obtain some important insights on this topic.

Paper Type: Long

Research Area: Generation

Research Area Keywords: LLM, pipeline, mathmatical reasoning, code generation

Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data analysis

Languages Studied: English

Submission Number: 7590

Loading