QCR: Quantised Codebooks for Retrieval

Silvia Sapora; Christos Baziotis; Roberto Dessi; Fabio Petroni; Michele Bevilacqua

QCR: Quantised Codebooks for Retrieval

Silvia Sapora, Christos Baziotis, Roberto Dessi, Fabio Petroni, Michele Bevilacqua

26 Sept 2024 (modified: 25 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: information retrieval, sparse retrieval, dense retrieval

TL;DR: This paper introduces Quantized Codebooks for Retrieval, a system to generate learned discrete representations, improving sparse retrieval performance by integrating these encodings into traditional inverted index architectures like BM25.

Abstract: In recent years, the application of language models (LMs) to retrieval tasks has gained significant attention. Dense retrieval methods, which represent queries and document chunks as vectors, have gained popularity, but their use at scale can be challenging. These models can under-perform traditional sparse approaches, like BM25, in some demanding settings, e.g. at web-scale or out-of-domain. Moreover the computational requirements, even with approximate nearest neighbour indices (ANN) can be hefty. Sparse methods, remain, thanks to their efficiency, ubiquitous in applications. In this work, we ask whether LMs can be leveraged to bridge this gap. We introduce Quantised Codebooks for Retrieval (QCR): we encode queries and documents as bags of latent discrete tokens, learned purely through a contrastive objective. QCR’s encodings can be used as a drop-in replacement for the original string in sparse retrieval indices, or can be instead used to complement the text with higher-level semantic features. Experimental results demonstrate that QCR outperforms BM25 with vanilla text on the challenging MSMARCO dataset. What is more, when used in conjunction with standard lexical matching, our representation yield and absolute 15.6% gain over BM25’s Success@100, highlighting the complementary nature of textual and learned discrete features.

Supplementary Material: zip

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7750

Loading