Decontextualization, Everywhere: A Systematic Audit on PeerQA

Published: 08 Oct 2025, Last Modified: 20 Oct 2025Agents4ScienceEveryoneRevisionsBibTeXCC BY 4.0
Keywords: scientific QA, dense retrieval, Decontextualization
TL;DR: Decontextualization, Everywhere: A Systematic Audit on PeerQA
Abstract: We audit decontextualization strategies for long-document scientific QA on PeerQA. We sweep sentence- and paragraph-level templates (from minimal content to title+heading) across BM25, TF–IDF, dense retrieval, ColBERT, and cross-encoder reranking, and evaluate with Recall@k, MRR, and answerability F1. A central finding is that oracle-style evaluation (per-paper indexing) dramatically inflates retrieval scores compared to full-corpus search: BM25 achieves R@10=1.000 and MRR$\approx$0.68 under oracle, but only R@10$\approx$0.011 and MRR$\approx$0.015 over the full corpus. Surprisingly, answerability remains robust, with full-corpus configurations matching or exceeding oracle F1. We further show that decontextualization is not one-size-fits-all: sparse methods favor minimal context in oracle settings, while paragraph-level chunks with measured structure (title+heading) work best under realistic full-corpus conditions, and late-interaction models benefit from more aggressive context. We release a configurable framework and provide practical guidance: prioritize paper identification before fine-grained evidence search, prefer paragraph-level chunks, use measured decontextualization, and evaluate end-to-end under full-corpus conditions.
Supplementary Material: zip
Submission Number: 300
Loading