CoPO: Contrastive Preference Optimization via On-Policy Reward Trajectory Alignment

CoPO: Contrastive Preference Optimization via On-Policy Reward Trajectory Alignment

ACL ARR 2026 May Submission16893 Authors

26 May 2026 (modified: 02 Jun 2026)ACL ARR 2026 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: preference optimization, implicit rewards, contrastive learning

Abstract: Preference optimization has become a standard paradigm for aligning large language models (LLMs) with human preferences. Existing finegrained preference optimization methods usually improve preference signal utilization beyond sequence-level objectives by introducing token-aware or trajectory-level supervision. However, existing methods optimize preference margins over observed responses, while autoregressive generation depends on decoding trajectories. This optimization mismatch causes supervision gradually narrows effective preference regions and leads to preference collapse. To address this issue, we propose Contrastive Preference Optimization (CoPO), a preference optimization framework that aligns preference supervision with generation behavior through reward trajectory alignment. Specifically, CoPO introduces auxiliary anchor responses sampled from the current policy and contrastively aligns their token-level implicit reward trajectories toward preferred responses while separating them from rejected ones. Our method expands the coverage of preference-consistent reward regions. Experiments on seven benchmarks demonstrate that CoPO consistently improves preference alignment across different LLM backbones and multi-backbone preference data.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: optimization methods

Contribution Types: NLP engineering experiment

Languages Studied: English

EMNLP 2026 AI Reviewing Experiment: no

Submission Number: 16893

Loading