Document Expansion Baselines and Learned Sparse Lexical Representations for MS MARCO V1 and V2

Xueguang Ma, Ronak Pradeep, Rodrigo Frassetto Nogueira, Jimmy Lin

Published: 2022, Last Modified: 08 Jan 2026SIGIR 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: With doc2query, we train a neural sequence-to-sequence model that, given an input span of text, predicts a natural language query that the text might answer. These predictions can be viewed as document expansions that feed standard bag-of-words term weighting models such as BM25 or neural retrieval models based on learned sparse lexical representations such as uniCOIL. Previous experiments on the MS MARCO datasets have demonstrated the effectiveness of these methods, and they serve as baselines that are widely used by the community today. Following the recent release of the MS MARCO V2 passage and document ranking test collections, we have refreshed our doc2query and uniCOIL models. This work describes a number of resources that support competitive, reproducible baselines for both the MS MARCO V1 and V2 test collections using our Anserini and Pyserini IR toolkits. Together, they provide a solid foundation for future research on neural retrieval models using the MS MARCO datasets and beyond.