Influence Guided Sampling for Domain Adaptation of Text Retrievers

Influence Guided Sampling for Domain Adaptation of Text Retrievers

ICLR 2026 Conference Submission24808 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: retrieval, domain adaptation, dataset sampling

TL;DR: We address domain adaptation in text retrievers via data reweighting and propose an efficient influence-based sampling strategy.

Abstract: General-purpose open-domain dense retrieval systems must usually be trained with a large, eclectic mix of corpora and search tasks. How should these diverse corpora and tasks be sampled for training? Conventional approaches are to sample them uniformly, or proportional to their instance population sizes, or depend on human-level expert supervision. It is well known that the training data sampling strategy can greatly impact model performance. However, how to find the optimal strategy has not been adequately studied in the context of embedding models. We propose Inf-DDS, a novel reinforcement learning–driven sampling framework that adaptively reweighs training datasets guided by influence‑based reward signals and is much more lightweight w.r.t. to GPU consumption. Our technique iteratively refines the sampling policy, prioritizing sampling from datasets that maximize the model performance on a target development set. We evaluate the efficacy of our sampling strategy on a wide range of text retrieval tasks, demonstrating strong improvements in retrieval performance and better adaptation compared to existing gradient-based sampling methods, while also being *1.5×–4×* cheaper than them in terms of GPU compute needed. Our sampling strategy achieves a **5.03** absolute *NDCG@10* improvement while training a multilingual *bge-m3-dense* model and an absolute *NDCG@10* improvement of **0.94** while training *sentence-transformers/all-MiniLM-L6-v2*, even when starting from an expert assigned weights on a large pool of training datasets.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 24808

Loading