On Preference Optimization in Large Language Models Under Pure Semantic Preferences

TMLR Paper6936 Authors

09 Jan 2026 (modified: 16 Jan 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs) are typically aligned with human preferences through methods such as direct preference optimization (DPO). While empirically successful, these approaches face well-known limitations, including length bias, reward hacking, binary preference assumptions, and the aggregation of heterogeneous preferences into a single scalar signal. In this work, we take an inverse perspective: rather than attempting to resolve these issues directly, we investigate an idealized setting, which we call the pure semantic preference scenario, where such confounding factors are absent. To formalize this setting, we decompose the log-likelihood preference gap between two semantically equivalent generations into three additive components: a length alignment gap, a syntactic alignment gap, and a semantic alignment gap, and study the regime in which the length and syntactic gaps are controlled to be zero, so that observed preferences reflect semantics alone. We show that even in this idealized setting, existing alignment methods still do not fully capture the preference. Our analysis further reveals that (i) on-policy algorithms align more effectively, (ii) models trained without an explicit reference model perform better, and (iii) preference-model-based approaches consistently outperform reward-model-based approaches. Finally, motivated by these observations, we propose a lightweight preference-matching optimization (PMO) with a closed-form optimum that is well-suited to the pure semantic setting. Experiments on both practical and idealized settings demonstrate performance comparable to standard alignment baselines in the practical setting, while yielding clearer theoretical interpretation and improved results in the pure semantic setting.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Weitong_ZHANG1
Submission Number: 6936
Loading