RLAR: An Agentic Reward System for Multi-task Reinforcement Learning on Large Language Models

Andrew Zhuoer Feng; Cunxiang Wang; Yidong Wang; Yilin Niu; Yu Luo; Hongning Wang; Minlie Huang

RLAR: An Agentic Reward System for Multi-task Reinforcement Learning on Large Language Models

Andrew Zhuoer Feng, Cunxiang Wang, Yidong Wang, Yilin Niu, Yu Luo, Hongning Wang, Minlie Huang

20 Sept 2025 (modified: 07 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reinforcement learning, large language model, agent

Abstract: Large language model alignment via reinforcement learning depends critically on reward function quality. However, generic reward models often underperform on heterogeneous task distributions due to distribution shifts, while training task‑specific reward models is costly and prone to annotation difficulty, catastrophic forgetting, and loss of generalization. We present RLVR (Reinforcement Learning from Agent Rewards), a unified, agent‑driven framework that dynamically assigns tailored reward functions to individual training queries. RLVR combines two automated LLM‑based stages. First, the tool generation stage where web-agents and code-agents generate rule‑, metric‑, and model‑based reward functions and wrap them as a callable tool. Then, there is a reward tool calling stage where a central decision LLM assign the reward function tools to individual queries. Across diverse tasks including translation, summarization, question answering, and mathematics, RLVR delivers $5$–$10$% average improvement over a widely‑used generic reward model (Skywork‑Reward‑V2) and matches GPT‑4.1‑as‑judge performance, while generalizing well to untrained benchmarks such as BenchMAX, AIME-2024 and Arena-Hard-v2. Ablation studies show performance drops of $40$%, $77$%, and $198$% when removing the web‑agent, code‑agent, and selection backbone, with the backbone achieving $86.50$% selection accuracy near the theoretical ceiling of top reward models. The retrieval module locates optimal tools reliably, with an average first‑page rank of $5.64$. By systematically leveraging and extending existing reward sources, RLVR offers a scalable path to high‑quality RL alignment over heterogeneous task domains.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 24697

Loading