Learning Monolingual Sentence Embeddings with Large-scale Parallel Translation DatasetsDownload PDF

Anonymous

16 Jan 2022 (modified: 05 May 2023)ACL ARR 2022 January Blind SubmissionReaders: Everyone
Abstract: Although contrastive learning has greatly improved sentence representation, its performance is still limited by the size of monolingual sentence-pair datasets. Meanwhile, there exist large-scale parallel translation pairs (100x larger than monolingual pairs) that are highly correlated in semantic, but have not been utilized for learning sentence representation. Furthermore, given parallel translation pairs, previous contrastive learning frameworks can not well balance the monolingual embeddings’ alignment and uniformity which represent the quality of embeddings. In this paper, we build on the top of dual encoder and propose to freeze the source language encoder, utilizing its consistent embeddings to supervise the target language encoder via contrastive learning, where source-target translation pairs are regarded as positives. We provide the first exploration of utilizing parallel translation sentence pairs to learn monolingual sentence embeddings and show superior performance to balance the alignment and uniformity. We achieve a new state-of-the-art performance on the average score of standard semantic textual similarity (STS), outperforming both SimCSE and Sentence-T5, and the best performance in corresponding tracks on transfer tasks.
Paper Type: long
0 Replies

Loading