DocSplit: Simple Contrastive Pretraining for Large Document Embeddings

Yujie Wang; Mike Izbicki

DocSplit: Simple Contrastive Pretraining for Large Document Embeddings

Yujie Wang, Mike Izbicki

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 FindingsEveryoneRevisionsBibTeX

Submission Type: Regular Short Paper

Submission Track: Semantics: Lexical, Sentence level, Document Level, Textual Inference, etc.

Submission Track 2: Information Retrieval and Text Mining

Keywords: Natural Language Processing; Machine Learning;Text Embeddings

Abstract: Existing model pretraining methods only consider local information. For example, in the popular token masking strategy, the words closer to the masked token are more important for prediction than words far away. This results in pretrained models that generate high-quality sentence embeddings, but low-quality embeddings for large documents. We propose a new pretraining method called DocSplit which forces models to consider the entire global context of a large document. Our method uses a contrastive loss where the positive examples are randomly sampled sections of the input document, and negative examples are randomly sampled sections of unrelated documents. Like previous pretraining methods, DocSplit is fully unsupervised, easy to implement, and can be used to pretrain any model architecture. Our experiments show that DocSplit outperforms other pretraining methods for document classification, few shot learning, and information retrieval tasks.

Submission Number: 579

Loading