InDi: Informative and Diverse Sampling for Dense Retrieval

Nachshon Cohen, Hedda Cohen Indelman, Yaron Fairstein, Guy Kushilevitz

Published: 01 Jan 2024, Last Modified: 30 Sept 2024ECIR (3) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Negative sample selection has been shown to have a crucial effect on the training procedure of dense retrieval systems. Nevertheless, most existing negative selection methods end by randomly choosing from some pool of samples. This calls for a better sampling solution. We define desired requirements for negative sample selection; the samples chosen should be informative, to advance the learning process, and diverse, to help the model generalize. We compose a sampling method designed to meet these requirements, and show that using our sampling method to enhance the training procedure of a recent significant dense retrieval solution (coCondenser) improves the obtained model’s performance. Specifically, we see a \(\sim 2\%\) improvement in MRR@10 on the MS MARCO dataset (from 38.2 to 38.8) and a \(\sim 1.5\%\) improvement in Recall@5 on the Natural Questions dataset (from \(71\%\) to \(72.1\%\)), both statistically significant. Our solution, as opposed to other methods, does not require training or inferencing a large model, and adds only a small overhead (\(\sim 1\%\) added time) to the training procedure. Finally, we report ablation studies showing that the objectives defined are indeed important when selecting negative samples for dense retrieval.