Multi-Agent Collaborative Reward Design for Enhancing Reasoning in Reinforcement Learning

Multi-Agent Collaborative Reward Design for Enhancing Reasoning in Reinforcement Learning

ACL ARR 2026 January Submission829 Authors

25 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Collaborative Reward Modeling; Reinforcement Learning from Human Feedback; Multi-Agent Systems; Reasoning Alignment

Abstract: We present CRM (Multi-Agent Collaborative Reward Model), a collaborative reward modeling framework that replaces a single black-box reward model with a coordinated set of specialized evaluators to improve robustness and interpretability in reinforcement learning from human feedback (RLHF). Conventional reward models struggle to simultaneously capture multiple, often competing, preference dimensions (e.g., factual correctness, helpfulness, and safety) and provide limited insight into the source of their scores. CRM addresses these limitations by decomposing preference evaluation into domain-specific reward agents, complemented by global signals such as ranker-based preferences and embedding-based semantic similarity. A centralized aggregation mechanism fuses these heterogeneous signals into a single scalar reward compatible with standard policy optimization, balancing step-wise correctness, signal-level consistency, and repetition penalties. Experiments on RewardBench and reasoning benchmarks such as GSM8K demonstrate that CRM significantly improves reasoning accuracy and training stability while preserving dialogue quality and safety.

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: Language Modeling: safety and alignment; LLM/AI agents; robustness; Interpretability and Analysis of Models for NLP: robustness; human-subject application-grounded evaluations

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Submission Number: 829

Loading