LLM-RankFusion: Mitigating Intrinsic Inconsistency in LLM-based Ranking

TMLR Paper6790 Authors

03 Dec 2025 (modified: 06 Mar 2026)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Ranking passages by prompting a large language model (LLM) can achieve promising performance in modern information retrieval (IR) systems. A common approach to sort the ranking list is by prompting LLMs for a pairwise or setwise comparison which often relies on sorting algorithms. However, sorting-based methods require consistent comparisons to sort the passages correctly, which we show that LLMs often violate. We identify two kinds of intrinsic inconsistency in LLM-based pairwise comparisons: order inconsistency which leads to conflicting results when switching the passage order, and transitive inconsistency which leads to non-transitive triads among all preference pairs. Our study of these inconsistencies is relevant for understanding and improving the stability of any ranking scheme based on relative preferences. In this paper, we propose LLM-RankFusion, an LLM-based ranking framework that mitigates these inconsistencies and produces a robust ranking list. LLM-RankFusion mitigates order inconsistency using in-context learning (ICL) to demonstrate order-agnostic comparisons and calibration to estimate the underlying preference probability between two passages. We then address transitive inconsistency by aggregating the ranking results from multiple rankers. In our experiments, we empirically show that LLM-RankFusion can significantly reduce inconsistent comparison results, improving the ranking quality by making the final ranking list more robust.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: ### Methodological & Metric Revisions * **Major Calibration Revision (Sec 4.1.2):** Replaced the nested softmax approach with a **log-odds framework** grounded in the Bradley-Terry model. This formulation uses continuous log-probabilities and the sigmoid function to mathematically cancel positional bias. * **Updated Discrepancy Metric (Table 1):** Updated the discrepancy formula to align with the new sigmoid-based calibration. All values in Table 1 were recalculated (e.g., Llama-3-70B adjusted from 0.92 to 0.46) and the caption was clarified. * **Formal Metric Definitions (Sec 5.1):** Added formal equations for DCG@k and NDCG@k, and un-commented the formal definition for Kendall-Tau distance. ### New Experiments & Analysis * **New Ranking Stability Experiment:** Added a new table (`model-agg-kt`) reporting Kendall-Tau variance across 10 random initial orders. New analysis demonstrates how aggregation methods (like MC4) mitigate the instability caused by transitive inconsistencies. * **New Related Work:** Added a paragraph discussing "Active Ranking and Comparison Oracles" (Tang et al., 2023; Chen et al., 2025) to contrast online/bandit strategies with our batch aggregation approach. * **Expanded Cost Discussion (Sec 5.4):** Clarified that while token cost increases linearly, parallelization keeps latency comparable to single rankers. Noted that calibration adds negligible overhead as it operates on existing logits. ### Textual & Formatting Improvements * **Refined Contribution Statement:** Revised the introduction to specifically claim novelty in *quantifying* the cascading effects of transitive inconsistency via hard-list stress tests and variance metrics. * **Table Formatting:** Improved structure and captions for the *Aggregation*, *Cross-Model*, and *Baselines* tables (added headers, separators, and more precise performance claims). * **Citation Style:** Updated all citations to `\citep{...}` format. * **Phrasing Polish:** Minor stylistic edits for flow and grammar (e.g., changing "variant ranked lists" to "significantly varying ranked lists").
Assigned Action Editor: ~Shiyu_Chang2
Submission Number: 6790
Loading