Keywords: Collaborative Reward Modeling; Reinforcement Learning from Human Feedback; Multi-Agent Systems; Reasoning Alignment
Abstract: We present CRM (Multi-Agent Collaborative Reward Model), a collaborative reward modeling framework that replaces a single black-box reward model with a coordinated set of specialized evaluators to improve robustness and interpretability in reinforcement learning from human feedback (RLHF). Conventional reward models struggle to simultaneously capture multiple, often competing, preference dimensions (e.g., factual correctness, helpfulness, and safety) and provide limited insight into the source of their scores. CRM addresses these limitations by decomposing preference evaluation into domain-specific reward agents, complemented by global signals such as ranker-based preferences and embedding-based semantic similarity. A centralized aggregation mechanism fuses these heterogeneous signals into a single scalar reward compatible with standard policy optimization, balancing step-wise correctness, signal-level consistency, and repetition penalties. Experiments on RewardBench and reasoning benchmarks such as GSM8K demonstrate that CRM significantly improves reasoning accuracy and training stability while preserving dialogue quality and safety.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: Language Modeling: safety and alignment; LLM/AI agents; robustness; Interpretability and Analysis of Models for NLP: robustness; human-subject application-grounded evaluations
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 829
Loading