Boosting Monolingual Sentence Representation with Large-scale Parallel Translation DatasetsDownload PDF

26 May 2022 (modified: 05 May 2023)ICML 2022 Pre-training WorkshopReaders: Everyone
Keywords: pre-training, language model
Abstract: Although contrastive learning greatly improves sentence representation, its performance is still limited by the size of existing monolingual datasets. So can semantically highly correlated massively parallel translation pairs be used for pre-training of monolingual models? This paper proposes an exploration of this. We leverage parallel translated sentence pairs to learn single-sentence sentence embeddings and demonstrate superior performance in balancing alignment and consistency. We achieve new state-of-the-art performance on the mean score of Standard Semantic Text Similarity (STS), outperforming both SimCSE and Sentence-T5.
0 Replies