Abstract: A dense retriever learns text embeddings to fetch relevant documents from a database in response to queries. However, real-world document streams constantly evolve, often diverging from the retriever's original training distribution. Indexing these documents without preemptive measures (e.g., updating or retraining) can lead to retrieval failures for future test queries. Hence, it is crucial to detect when to update dense retrievers before those test queries arrive, ensuring the retrieval system's maintenance. To address this challenge, we introduce a novel task of predicting whether a given corpus is out-of-domain (OOD) for a dense retriever before indexing. This task enables us to assess whether using the current retriever on the given corpus creates vulnerabilities for future test queries. We propose GradNormIR, a novel unsupervised method that leverages gradient norms to detect OOD documents within a given corpus. Experiments on the BEIR benchmark demonstrate that our method facilitates timely retriever updates in evolving corpora, providing valuable guidance for building an efficient and robust retriever system.
Paper Type: Long
Research Area: Information Retrieval and Text Mining
Research Area Keywords: passage retrieval,dense retrieval
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings
Languages Studied: English
Submission Number: 6944
Loading