A Fine-Grained Analysis of Pure Semantic Preference Alignment in Large Language Models

20 Sept 2025 (modified: 15 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Reference Alignment, Human Feedback
Abstract: Large language models (LLMs) are typically aligned with human preferences through methods such as direct preference optimization (DPO). While empirically successful, these approaches face well-known limitations, including length bias, reward hacking, binary preference assumptions, and the aggregation of heterogeneous preferences into a single scalar signal. In this work, we take an inverse perspective: rather than attempting to resolve these issues, we investigate an idealized setting, which we call the *pure semantic preference scenario*, where such confounding factors are absent. We show that even in this idealized setting, existing alignment methods still do not fully capture the preference. Our analysis further reveals that (i) on-policy algorithms align more effectively, (ii) models trained without an explicit reference model perform better, and (iii) preference-model–based approaches consistently outperform reward-model–based approaches. Motivated by these observations, we introduce *preference matching optimization* (PMO), a DPO-type method that admits a closed-form solution and provably better approximates the true preference distribution. Experiments on both practical and idealized settings demonstrate that PMO achieves comparable performance with existing alignment methods in the practical setting, while offering stronger theoretical grounding and better performance in the pure semantic setting.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 22896
Loading