Keywords: Alignment, Preference learning,Adversarial Learning, Instruction Following
Abstract: Shaping the behavior of powerful Large Language Models (LLMs) to be both beneficial and safe is the central challenge of modern AI alignment. We posit that the post-training alignment process is fundamentally a unified challenge of Preference Learning, encompassing two distinct modalities: learning from demonstrated preferences (e.g., Supervised Fine-Tuning, SFT) and from comparative preferences (e.g., Reinforcement Learning, RL). The current industry-standard pipeline, which processes these preference types sequentially, is inherently flawed due to a critical distributional mismatch between the static expert data and the dynamic policy. This creates two interconnected problems: (1) Offline SFT trains on a fixed expert distribution, but as the policy's own generation distribution drifts, the learned knowledge becomes brittle and unreliable. (2) Subsequent online RL explores to improve generalization, but it operates without direct access to the rich, ground-truth knowledge within the expert demonstrations, making its exploration inefficient and ungrounded. This fundamental separation prevents the two data sources from synergistically regularizing each other. To resolve this, we first reframe alignment as a constrained optimization problem. We then propose Unified Adversarial Preference Learning (UniAPL), a novel framework that directly operationalizes this theory by dynamically bridging the gap between the policy's distribution and the expert's distribution. The ultimate expression of our framework is a simplified, single-stage unified training objective. This approach cohesively learns from mixed batches of SFT and preference feedback data, allowing the dense expert data to directly ground and regularize the online exploration process in every gradient update. This concurrent optimization inherently mitigates the distributional mismatch and maximizes data synergy.We empirically validate our approach on instruction-following tasks using Qwen3-235B-Instruct-2507 as the expert teacher. Our model demonstrates comparable or superior general capabilities in English, coding, mathematics, and Chinese, while significantly enhancing instruction-following ability; it surpasses the strong GRPO baseline by 5.77\% on Qwen3-0.6B—matching a 4B model’s performance—and exceeds 3.75\% on Qwen3-4B, even outperforming the teacher model. Furthermore, analysis of response length and log-probability (logp) distributions shows that models trained with UniAPL not only achieve stronger performance but also generate outputs closely resembling expert demonstrations.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 10652
Loading