Enhancing Sparse Retrieval via Unsupervised Learning

Published: 01 Jan 2023, Last Modified: 07 Feb 2025SIGIR-AP 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Recent work has shown that neural retrieval models excel at text ranking tasks in a supervised setting when given large amounts of manually labeled training data. However, it remains an open question how to train unsupervised retrieval models that are more effective than baselines such as BM25. While some progress has been made in unsupervised dense retrieval models within a bi-encoder architecture, unsupervised sparse retrieval models remain unexplored. We propose BM26, to our knowledge the first such model, which is trained in an unsupervised manner without the need for any human relevance judgments. Evaluations with multiple test collections show that BM26 performs on par with BM25 and outperforms Contriever, the current state-of-the-art unsupervised dense retriever. We further demonstrate two promising avenues to enhance lexical retrieval: First, we can combine BM25 and BM26 using simple vector concatenation to yield an unsupervised hybrid BM51 model that significantly improves over BM25 alone. Second, we can enhance supervised sparse models such as SPLADE with improved initialization using BM26, yielding significant improvements in in-domain and zero-shot retrieval effectiveness.
Loading