Abstract: Preference optimization methods such as DPO often yield aligned models that are overly deterministic, reducing output diversity and increasing the risk of mode collapse. This can limit downstream applications that benefit from multiple plausible outputs, such as reasoning and search. We propose Soft Preference Optimization (SPO), a reward-model-free algorithm that controls entropy of the aligned model through a ``softness'' parameter. SPO minimizes a preference-based loss together with a global KL regularization term, which helps prevent unwanted distribution shifts outside the preference dataset. While the method does not rely on any reward model assumption, we provide theoretical guarantees that under a Bradley–Terry assumption, it converges to a softmax distribution over the expert rewards. We present the methodology, theoretical analysis, and comparative advantages in alignment precision and output diversity.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Weitong_ZHANG1
Submission Number: 6955
Loading