SimLTE: Simple Contrastive Learning for Long Text Embeddings

Anonymous

SimLTE: Simple Contrastive Learning for Long Text Embeddings

Anonymous

16 Dec 2022 (modified: 05 May 2023)ACL ARR 2022 December Blind SubmissionReaders: Everyone

Abstract: This paper presents SIMLTE, the first unsuper002 vised pretraining method designed specifically for long text (e.g., documents, paragraphs). SIMLTE uses the contrastive learning frame005 work, and our main contribution is a simple but effective data augmentation technique for generating similar text pairs. Specifically, we pretrain a language model to distinguish if two texts have the same topic without any super010 vision or specific model architectures, and so it is widely applicable. The positive pairs are constructed by our key information redundancy assumption for long text. On standard classifi014 cation datasets, SIMLTE improves all baseline models, with an average improvement of 3.9% macro F1 score. We also consider a few-shot setting where we show an average improvement of 12.0%.

Paper Type: short

Research Area: Semantics: Sentence-level Semantics, Textual Inference and Other areas

0 Replies

Loading