Positional Bias in Long-Document Ranking: Impact, Assessment, and Mitigation

ACL ARR 2025 May Submission6077 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: We tested over 20 Transformer models for ranking of long documents (including recent LongP models trained with FlashAttention and RankGPT models “powered” by OpenAI and Anthropic cloud APIs). We compared them with a simple FirstP baselines, which applied the same model to the truncated input (at most 512 tokens). On MS MARCO, TREC DL, and Robust04 no long-document model outperformed FirstP by more than 5% (on average). We hypothesized that the lack of improvement by long-context models is not due to inherent model limitations, but due to benchmark positional bias (most relevant passages tend to occur early in documents). To further confirm this we analyzed positional relevance distributions across five corpora and six query sets and observed the same early-position bias. We then introduced a new diagnostic dataset, MS MARCO FarRelevant, where relevant spans were deliberately placed beyond the first 512 tokens. On this dataset, many long-context models—including RankGPT—failed to generalize and performed near the random baseline, suggesting overfitting to positional bias. Finally, we experimented with de-biasing the training data, but the success of this approach was mixed. Our findings (1) highlight the need for careful benchmark design in evaluating long-context models for document ranking, (2) identify model types that are more robust to positional bias, and (3) motivate further work on approaches to de-bias training data. We release our code and data to support further research.
Paper Type: Long
Research Area: Information Retrieval and Text Mining
Research Area Keywords: neural ranking
Contribution Types: Model analysis & interpretability, Reproduction study, Data resources
Languages Studied: English
Submission Number: 6077
Loading