Keywords: query rewriting, dense retrieval, RAG, LLM, vocabulary drift, terminology standardization, BEIR, domain adaptation
TL;DR: Single-step LLM query rewriting degrades dense retrieval 9% on well-optimized domains via vocabulary drift, improves 5% on inconsistent-terminology domains; harm persists across rewriter families; simple gating cannot beat never-rewriting.
Abstract: Prompt-only, single-step LLM query rewriting (i.e., a single rewrite generated from the query alone, without retrieval feedback) is commonly deployed in production RAG pipelines, but its impact on dense retrieval is poorly understood. We conduct a systematic empirical study across three BEIR benchmarks, two dense retrievers, and multiple training configurations, and find strongly domain-dependent effects: rewriting degrades nDCG@10 by 9.0% on FiQA (p < 0.001), improves by 5.1% on TREC-COVID (p = 0.024; n=50, marginal after correction), and has no effect on SciFact (p = 0.47). We identify a consistent mechanism: degradation co-occurs with reduced lexical alignment between the query and ground-truth relevant documents, measured by VOR (∆VOR, p = 0.013), as rewriting substitutes domain-specific terms on already well-matched queries. Improvement occurs when rewriting harmonizes inconsistent nomenclature toward corpus-preferred terminology, captured by a Corpus Term Frequency ratio (CTF; 1153× for improved vs. 2.7× for degraded TREC-COVID queries, p < 0.001). Even with privileged post-hoc signals, simple feature-based gating (AUC = 0.593) cannot reliably improve over never-rewriting (p > 0.12), and oracle analysis reveals only a +3 pp ceiling. These results caution that prompt-only rewriting can be harmful in well-optimized verticals, and motivate post-training as a safer adaptation.
Submission Number: 122
Loading