JoyAgents-R1: Accelerating Multi-Agent Evolution Dynamics with Variance-Reduction Group Relative Policy Optimization

JoyAgents-R1: Accelerating Multi-Agent Evolution Dynamics with Variance-Reduction Group Relative Policy Optimization

ICLR 2026 Conference Submission16622 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-Agent Systems, Joint Evolution Dynamics, Group Relative Policy Optimization

Abstract: Large Language Model (LLM)-based multi-agent systems represent a promising paradigm with broad applicability, exemplified by general-purpose Artificial Intelligence (AI) assistants capable of performing multiple tasks. Nevertheless, joint optimization across functionally distinct agents remains challenging due to divergent working modes and reward functions. To address this issue, we introduce JoyAgents-R1, a framework that accelerates multi-agent evolution with a novel Variance-Reduction Group Relative Policy Optimization (VR-GRPO), integrating efficient sampling and update strategies. Specifically, VR-GRPO performs Monte Carlo sampling based on an initial reasoning trajectory to avoid the exponential explosion of the joint action space while maintaining policy diversity. Then, the method selects the top-$K$ sampling groups with maximal reward fluctuations based on the marginal benefit principle, thereby enabling cost-effective parameter updates. To further complement evolution, an adaptive memory evolution mechanism that repurposes GRPO rewards as cost-free supervisory signals is designed to eliminate repetitive reasoning and accelerate convergence. Experiments on multi-task AI assistant datasets across both general and e-commerce scenarios demonstrate that JoyAgents-R1, built upon smaller 3B/7B open-source models, achieves performance comparable to that of larger LLMs, such as DeepSeek-R1, and surpasses DeepSeek-V3 by an average of 6\%.

Supplementary Material: zip

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 16622

Loading