C2FDataset: A Comprehensive Parallel Dataset for Transitions between Persian Colloquial and Formal VariationsDownload PDF


16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
Abstract: Natural language processing models have shown promising results in analyzing Persian text. However, their performance drops significantly when applied to colloquial Persian sentences, resulting in inaccurate predictions. This challenge arises from the substantial differences between colloquial and formal Persian and the lack of parallel data for colloquial to formal Persian translation. In addressing this gap, our research is dedicated to the development of the C2FDataset, a large-scale colloquial to formal Persian parallel dataset. Our proposed dataset is a critical resource for training models that can effectively bridge the linguistic variations between colloquial and formal Persian. To illustrate the utility of our dataset, we trained a GPT-2 model on it. The model exhibited remarkable proficiency in colloquial to formal text style transfer, outperforming both ChatGPT and a leading rule-based system in this task. This conclusion is supported by our extensive human evaluation, conducted using a ranking-based scoring system we designed for this purpose. The results underscore the significance of the C2FDataset in enhancing the performance of natural language processing models in the challenging domain of colloquial to formal Persian conversion.
Paper Type: short
Research Area: Multilinguality and Language Diversity
Contribution Types: Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: Persian
0 Replies
