Dynamic-anchored Preference Optimization for Human-Like Moral Alignment

18 Sept 2025 (modified: 02 Feb 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Moral Foundation, LLMs, Preference
Abstract: Preference optimization has become a widely used approach to align large language models (LLMs) with human values. Direct Preference Optimization (DPO) provides a simple and reward-model-free solution, but it relies on static binary preference pairs and a fixed reference policy, which limits its ability to capture multi-dimensional moral signals and makes it sensitive to conflicting prompts. To address these limitations, we propose \textit{Dynamic-anchored Preference Optimization (DAPO)}, an extension of DPO that incorporates moral preference reconstruction and adaptive-weighted optimization. It introduces: (1) a dynamic-anchored triplet construction mechanism grounded in Moral Foundations Theory (MFT), which enables exploration of both benevolence reinforcement and malevolence suppression; (2) a value-guided pairwise loss with heuristic adaptive weighting to balance training signals while reducing reliance on a fixed reference policy. Experiments on benchmarks covering emotional understanding, moral reasoning and factual consistency show that \textit{DAPO} consistently improves accuracy and robustness compared to DPO-based methods. Further sensitivity analyses demonstrate that \textit{DAPO} provides a practical extension to DPO, making preference optimization more reliable and effective for moral alignment.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 10904
Loading