Paraphrase Identification Datasets: Usage Survey and Generalization Patterns

ACL ARR 2024 June Submission1819 Authors

15 Jun 2024 (modified: 18 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: We perform a survey to identify the most commonly used paraphrase identification datasets. We then look deeper at the top three English datasets containing sentential paraphrases, comparing various qualitative and quantitative characteristics of the datasets. In addition, we investigate the generalization performance of modern models trained on these datasets, showing that models do not generalize well across datasets, showing a weakness in real-world generalisation ability. Lastly, we test some methods to improve generalisation ability, showing that MNLI pre-training and improved label consistency are useful.
Paper Type: Long
Research Area: Semantics: Lexical and Sentence-Level
Research Area Keywords: paraphrase, survey, semantics
Contribution Types: NLP engineering experiment, Data analysis, Surveys
Languages Studied: English
Submission Number: 1819
Loading