Rescaling Loss based on Importance of Machine-written SentenceDownload PDF

Anonymous

08 Mar 2022 (modified: 05 May 2023)NAACL 2022 Conference Blind SubmissionReaders: Everyone
Paper Link: https://openreview.net/forum?id=ZtrkqD5N_39
Paper Type: Short paper (up to four pages of content + unlimited references and appendices)
Abstract: Semantically meaningful sentence embeddings are important for numerous tasks in natural language processing. To obtain such embeddings, recent studies explored the idea of learning sentence embeddings from synthetic data generated from pretrained language models~(PLMs). However, PLMs often generate unrealistic sentences (i.e., sentences different from human-written sentences). We hypothesize that training a model with these unrealistic sentences can have an adverse effect on learning semantically meaningful embeddings. To analyze this, we first train a classification model that identifies unrealistic sentences and observe that the linguistic features of the sentences predicted as unrealistic are much different from those of human-written sentences. Based on this, we suggest a new method that first utilizes the classifier to measure the importance of each sentence. The distilled information from the classifier is then used to train a reliable sentence embedding model. Through extensive evaluation on four diverse datasets, we demonstrate that our model trained on synthetic data generalizes well and outperforms the baselines.
0 Replies

Loading