Longtriever: a Pre-trained Long Text Encoder for Dense Document Retrieval

Junhan Yang; Zheng Liu; Chaozhuo Li; Guangzhong Sun; Xing Xie

Longtriever: a Pre-trained Long Text Encoder for Dense Document Retrieval

Junhan Yang, Zheng Liu, Chaozhuo Li, Guangzhong Sun, Xing Xie

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 MainEveryoneRevisionsBibTeX

Submission Type: Regular Long Paper

Submission Track: Information Retrieval and Text Mining

Submission Track 2: NLP Applications

Keywords: dense retrieval, document retrieval

Abstract: Pre-trained language models (PLMs) have achieved the preeminent position in dense retrieval due to their powerful capacity in modeling intrinsic semantics. However, most existing PLM-based retrieval models encounter substantial computational costs and are infeasible for processing long documents. In this paper, a novel retrieval model Longtriever is proposed to embrace three core challenges of long document retrieval: substantial computational cost, incomprehensive document understanding, and scarce annotations. Longtriever splits long documents into short blocks and then efficiently models the local semantics within a block and the global context semantics across blocks in a tightly-coupled manner. A pre-training phase is further proposed to empower Longtriever to achieve a better understanding of underlying semantic correlations. Experimental results on two popular benchmark datasets demonstrate the superiority of our proposal.

Submission Number: 624

Loading