Listwise Generalized Preference Optimization with Process-aware Signals for LLM Reasoning

ICLR 2026 Conference Submission25060 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: RL Optimization, listwise ranking
Abstract: Standard preference optimization methods for LLMs suffer from two limitations: pairwise objectives like DPO discard valuable ranking information, and outcome-only supervision provides sparse feedback for multi-step reasoning. We propose Listwise Generalized Preference Optimization with Process-Aware signals (LGPO-PA), which combines listwise ranking objectives with dense process-level supervision. Our method scores multiple candidate responses using step-level process rewards, execution feedback, and consistency checks, then optimizes a convex listwise loss. Across mathematical reasoning (GSM8K, MATH), code generation (HumanEval, MBPP), and multi-hop QA (HotpotQA), LGPO-PA outperforms pairwise methods by 8-12\% and listwise methods without process signals by 6-9\%, while maintaining full offline operation. Ablations confirm that listwise optimization (+4.2\%) and process-aware scoring (+5.1\%) provide complementary benefits.
Primary Area: reinforcement learning
Submission Number: 25060
Loading