Noisy Self-Training with Synthetic Queries for Dense Retrieval

Fan Jiang; Tom Drummond; Trevor Cohn

Noisy Self-Training with Synthetic Queries for Dense Retrieval

Fan Jiang, Tom Drummond, Trevor Cohn

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 FindingsEveryoneRevisionsBibTeX

Submission Type: Regular Long Paper

Submission Track: Information Retrieval and Text Mining

Keywords: dense retrieval, self training, synthetic queries

TL;DR: A noisy self-training method to use synthetic queries to enhance dense retrievers on both general-domain and out-of-domain.

Abstract: Although existing neural retrieval models reveal promising results when training data is abundant and the performance keeps improving as training data increases, collecting high-quality annotated data is prohibitively costly. To this end, we introduce a novel noisy self-training framework combined with synthetic queries, showing that neural retrievers can be improved in a self-evolution manner with no reliance on any external models. Experimental results show that our method improves consistently over existing methods on both general-domain (e.g., MS-MARCO) and out-of-domain (i.e., BEIR) retrieval benchmarks. Extra analysis on low-resource settings reveals that our method is data efficient and outperforms competitive baselines, with as little as 30\% of labelled training data. Further extending the framework for reranker training demonstrates that the proposed method is general and yields additional gains on tasks of diverse domains.\footnote{Source code is available at \url{https://github.com/Fantabulous-J/Self-Training-DPR}}

Submission Number: 1254

Loading