GradNormIR: When Should We Update the Dense Retriever in Evolving Corpora?

GradNormIR: When Should We Update the Dense Retriever in Evolving Corpora?

ACL ARR 2025 February Submission6944 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: A dense retriever learns text embeddings to fetch relevant documents from a database in response to queries. However, real-world document streams constantly evolve, often diverging from the retriever's original training distribution. Indexing these documents without preemptive measures (e.g., updating or retraining) can lead to retrieval failures for future test queries. Hence, it is crucial to detect when to update dense retrievers before those test queries arrive, ensuring the retrieval system's maintenance. To address this challenge, we introduce a novel task of predicting whether a given corpus is out-of-domain (OOD) for a dense retriever before indexing. This task enables us to assess whether using the current retriever on the given corpus creates vulnerabilities for future test queries. We propose GradNormIR, a novel unsupervised method that leverages gradient norms to detect OOD documents within a given corpus. Experiments on the BEIR benchmark demonstrate that our method facilitates timely retriever updates in evolving corpora, providing valuable guidance for building an efficient and robust retriever system.

Paper Type: Long

Research Area: Information Retrieval and Text Mining

Research Area Keywords: passage retrieval,dense retrieval

Contribution Types: Model analysis & interpretability, Approaches to low-resource settings

Languages Studied: English

Submission Number: 6944

Loading