Keywords: large language models, policy optimization, multi-domain reasoning
TL;DR: We propose CGPO, a scalable, curvature-guided RL method that leverages cross-domain gradient interactions to enhance multi-domain reasoning in LLMs, achieving faster and more consistent improvements across diverse tasks.
Abstract: Multi-domain reinforcement learning (RL) for large language models (LLMs) involves highly intricate reward surfaces, posing significant challenges in finding parameters that excel across all domains. Recent empirical studies have further highlighted conflicts among domains, where gains in one capability often come at the expense of another. However, approaches to mitigate such conflicts and enhance multi-domain reasoning remain largely underexplored. To address this challenge, we propose **C**urvature-**G**uided **P**olicy **O**ptimization (**CGPO**), a principled and scalable training framework to advance the multi-domain reasoning of LLMs. Inspired by Newton's method, CGPO exploits the geometric structure in the reward surface, while sidestepping the prohibitive cost of Hessian computation. At each update, CGPO processes domains in random order, preconditioning their gradients with curvature information from other domains to foster richer cross-domain interactions. This mechanism further promotes implicit gradient alignment by maximizing inter-domain inner products in expectation, steering the parameters toward regions that jointly enhance multi-domain performance. Extensive experiments on a mixed dataset covering math, coding, science, and creative writing, evaluated across seven widely-used benchmarks, show that CGPO significantly outperforms all baselines in terms of faster reward improvement and stronger multi-domain capability.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 19328
Loading