Keywords: large language models, policy optimization, gradient conflicts
TL;DR: We propose CAPO, a scalable policy optimization method for LLMs that dynamically aggregates gradients via a conflict-aware SOCP, reducing interference among sampled responses and improving multi-response training performance.
Abstract: Recent advancements in policy optimization techniques have profoundly improved the reasoning abilities of large language models (LLMs). A pivotal breakthrough lies in sampling a group of responses for each query and adjusting their likelihoods based on the relative advantages of their scores over the group mean. However, substantial conflicts may arise between the aggregated gradient and the individual gradients of the responses, thus diminishing the effectiveness of gradient signals and ultimately hindering the training performance. To address this challenge, we propose **C**onflict-**A**ware **P**olicy **O**ptimization (**CAPO**), a novel and scalable training method that mitigates conflicts through dynamic gradient aggregation. Specifically, CAPO formulates the gradient aggregation step as a *second-order cone program (SOCP)*, which seeks a gradient direction maximizing the alignment with positive-advantage responses, while enforcing constraints to suppress negative-advantage responses. To equip the SOCP with scalability and tractability for LLMs, we significantly reduce the number of variables via the Lagrangian duality and compress the gradient dimension using the Johnson-Lindenstrauss transform. We further show that the dynamic gradient aggregation effectively reduces conflicts without sacrificing the convergence. Experiments on several widely-used mathematical reasoning datasets and benchmarks with Qwen2.5-1.5B and Qwen2.5-3B show that CAPO consistently outperforms our baselines in terms of the accuracy.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 19135
Loading