PLOT: Enhancing Preference Learning via Optimal Transport

PLOT: Enhancing Preference Learning via Optimal Transport

ACL ARR 2025 February Submission6422 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Preference learning in large language models (LLMs) has primarily followed two approaches: (1) fine-tuning-based methods that optimize models using human preference signals and (2) inference-phase techniques that regulate outputs through decoding-time interventions. While these methods effectively mitigate harmful content generation, they remain vulnerable to adversarial jailbreak attacks and suffer from limitations such as high computational costs, sensitivity to hyperparameters, and insufficient consideration of global token-level relationships. This paper introduces **PLOT**, a method that enhances the **P**reference **L**earning capability of fine-tuning-based alignment techniques through a token-level loss term derived from **O**ptimal **T**ransport. By modeling preference learning as an **Optimal Transport Problem**, PLOT aligns model outputs with human preferences while preserving the model’s original distribution, thereby ensuring stability and robustness. Additionally, PLOT incorporates **token embeddings** to capture rich semantic relationships, enabling a more globally informed optimization process. Our experimental evaluations demonstrate that PLOT **significantly reduces attack success rates (ASR) across various red-teaming adversarial attacks** while maintaining general model performance. Compared to baseline fine-tuning methods, PLOT achieves **a reduction of up to 8.83\% in ASR** while preserving fluency and coherence in general tasks. These results establish optimal transport as a principled and effective approach to preference learning, offering a robust framework for enhancing model alignment, safety, and adversarial robustness.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: fine-tuning, preference learning, security and privacy, red teaming

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study, Theory

Languages Studied: English

Submission Number: 6422

Loading