Learning to Tokenize for Generative Retrieval

Published: 21 Sept 2023, Last Modified: 02 Nov 2023NeurIPS 2023 posterEveryoneRevisionsBibTeX
Keywords: Information Retrieval, Document Retrieval, Generative Retrieval
TL;DR: This paper proposes a document tokenization learning method based on discrete auto-encoding to improve generative retrieval.
Abstract: As a new paradigm in information retrieval, generative retrieval directly generates a ranked list of document identifiers (docids) for a given query using generative language models (LMs). How to assign each document a unique docid (denoted as document tokenization) is a critical problem, because it determines whether the generative retrieval model can precisely retrieve any document by simply decoding its docid. Most existing methods adopt rule-based tokenization, which is ad-hoc and does not generalize well. In contrast, in this paper we propose a novel document tokenization learning method, GenRet, which learns to encode the complete document semantics into docids. GenRet learns to tokenize documents into short discrete representations (i.e., docids) via a discrete auto-encoding approach. We develop a progressive training scheme to capture the autoregressive nature of docids and diverse clustering techniques to stabilize the training process. Based on the semantic-embedded docids of any set of documents, the generative retrieval model can learn to generate the most relevant docid only according to the docids' semantic relevance to the queries. We conduct experiments on the NQ320K, MS MARCO, and BEIR datasets. GenRet establishes the new state-of-the-art on the NQ320K dataset. Compared to generative retrieval baselines, GenRet can achieve significant improvements on unseen documents. Moreover, GenRet can also outperform comparable baselines on MS MARCO and BEIR, demonstrating the method's generalizability.
Submission Number: 9101