Descriptive and Discriminative Document Identifiers for Generative Retrieval

Published: 01 Jan 2025, Last Modified: 23 Jul 2025AAAI 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Generative document retrieval is a novel retrieval framework, which represents documents as identifiers (DocID) and retrieves documents by generating DocIDs. It has the advantage of end-to-end optimization over traditional retrieval methods and has attracted much research interest. Nonetheless, the development of efficient and precise DocIDs for document representation remains a pertinent issue within the field. Existing methods for designing DocIDs tend to consider only the relevance of DocIDs to the corresponding documents, while neglecting the ability of the DocIDs to distinguish the corresponding documents from similar ones, which is crucial for the retrieval task. In this paper, we design learnable descriptive and discriminative document Identifiers (D2-DocID) for Generative Retrieval and propose the paired retrieval model D2Gen. The D2-DocID is semantically similar to the corresponding documents (descriptive) and is able to distinguish similar documents (discriminative) in the corpus, thus enhancing retrieval performance. We use a contrastive learning assisted generative retrieval task to enable the model to understand the document and then complete the generative retrieval. We then design a DocID selection method to select DocIDs based on the retrieval model's understanding of the documents. Our experimental results on the MS MARCO and NQ320k dataset illustrate the effectiveness of the approach.
Loading