In-context Ranking Preference Optimization

Junda Wu; Rohan Surana; Zhouhang Xie; Yiran Shen; Yu Xia; Tong Yu; Ryan A. Rossi; Prithviraj Ammanabrolu; Julian McAuley

In-context Ranking Preference Optimization

Junda Wu, Rohan Surana, Zhouhang Xie, Yiran Shen, Yu Xia, Tong Yu, Ryan A. Rossi, Prithviraj Ammanabrolu, Julian McAuley

Published: 08 Jul 2025, Last Modified: 26 Aug 2025COLM 2025EveryoneRevisionsBibTeXCC BY-NC-SA 4.0

Keywords: direct preference optimization, large language model

TL;DR: We introduce IRPO, a framework that optimizes LLMs using natural, in-context ranking feedback to enhance ranking quality while reducing computational cost.

Abstract: Recent developments in Direct Preference Optimization (DPO) allow large language models (LLMs) to function as implicit ranking models by maximizing the margin between preferred and non-preferred responses. In practice, user feedback on such lists typically involves identifying a few relevant items in context rather than providing detailed pairwise comparisons for every possible item pair. Besides, many complex information retrieval tasks, such as conversational agents and summarization systems, critically depend on ranking the highest-quality outputs at the top, further emphasizing the need to support natural and flexible forms of user feedback. To address the challenge of limited and sparse pairwise feedback in the in-context setting, we propose an In-context Ranking Preference Optimization (IRPO) framework that directly optimizes LLMs based on ranking lists constructed during inference. To further capture the natural and flexible forms of feedback, IRPO extends the DPO objective by incorporating both the relevance of items and their positions in the list. Modeling these aspects jointly is non-trivial, as ranking metrics are inherently discrete and non-differentiable, making direct optimization challenging. To overcome this, IRPO introduces a differentiable objective based on positional aggregation of pairwise item preferences, enabling effective gradient-based optimization of discrete ranking metrics. We further provide theoretical insights showing that IRPO (i) automatically emphasizes items with greater disagreement between the model and the reference ranking, and (ii) shows its gradient's linkage to an importance sampling estimator, resulting in an unbiased gradient estimator with reduced variance. Empirical evaluations demonstrate that IRPO outperforms standard DPO approaches in ranking performance, highlighting its effectiveness and efficiency in aligning LLMs with direct in-context ranking preferences.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 1164

Loading