Large Language Models as a Normalizer for Transliteration and Dialectal Translation

ACL ARR 2024 June Submission761 Authors

13 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: NLP models trained on standardized language data often struggle with variations. We assess various Large Language Models (LLMs) for transliteration and dialectal normalization. Tuning open-source LLMs with as little as 10,000 parallel examples using LoRA can achieve results comparable to or better than closed-source LLMs. We perform dialectal normalization experiments for twelve South Asian languages and dialectal translation experiments for six language continua worldwide. The dialectal normalization task can also be a preliminary step for the downstream dialectal translation task. Among the six languages used in dialectal translation, our approach enables Italian and Swiss German to surpass the baseline model by 21.55 and 25.79 BLEU points, respectively.
Paper Type: Long
Research Area: Machine Translation
Research Area Keywords: Efficient/Low-Resource Methods for NLP, Machine Translation
Contribution Types: Approaches to low-resource settings
Languages Studied: Arabic, Bengali, Basque, Italian, Kurdish, Swiss German, Gujarati, Hindi, Kannada, Malayalam, Marathi, Punjabi, Sindhi, Sinhala, Tamil, Telugu, and Urdu
Submission Number: 761
Loading