C2FDataset: A Comprehensive Parallel Dataset for Transitions between Persian Colloquial and  Formal Variations

Anonymous

C2FDataset: A Comprehensive Parallel Dataset for Transitions between Persian Colloquial and Formal Variations

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone

Abstract: Natural language processing models have shown promising results in analyzing Persian text. However, their performance drops significantly when applied to colloquial Persian sentences, resulting in inaccurate predictions. This challenge arises from the substantial differences between colloquial and formal Persian and the lack of parallel data for colloquial to formal Persian translation. In addressing this gap, our research is dedicated to the development of the C2FDataset, a large-scale colloquial to formal Persian parallel dataset. Our proposed dataset is a critical resource for training models that can effectively bridge the linguistic variations between colloquial and formal Persian. To illustrate the utility of our dataset, we trained a GPT-2 model on it. The model exhibited remarkable proficiency in colloquial to formal text style transfer, outperforming both ChatGPT and a leading rule-based system in this task. This conclusion is supported by our extensive human evaluation, conducted using a ranking-based scoring system we designed for this purpose. The results underscore the significance of the C2FDataset in enhancing the performance of natural language processing models in the challenging domain of colloquial to formal Persian conversion.

Paper Type: short

Research Area: Multilinguality and Language Diversity

Contribution Types: Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources, Data analysis

Languages Studied: Persian

0 Replies

Loading