Shiksha: A Technical Domain focused Translation Dataset and Model for Indian Languages

ACL ARR 2024 June Submission2414 Authors

15 Jun 2024 (modified: 04 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Neural Machine Translation (NMT) models are typically trained on datasets with limited exposure to Scientific, Technical and Educational domains. Translation models thus, in general, struggle with tasks that involve scientific understanding or technical jargon. Their performance is found to be even worse for low-resource Indian languages. Finding a translation dataset that tends to these domains in particular, poses a difficult challenge. In this paper, we address this by creating a multilingual parallel corpus containing more than 2.8 million rows of English-to-Indic and Indic-to-Indic high-quality translation pairs across 8 Indian languages. We achieve this by bitext mining human-translated transcriptions of NPTEL video lectures. We also finetune and evaluate NMT models using this corpus and surpass all other publicly available models at in-domain tasks. We also manage to improve the baseline by over 2 BLEU for these Indian languages on average, thus demonstrating the potential for generalizing to out-of-domain translation tasks as well. We are pleased to release the corresponding models and dataset, accessible via this link: https://huggingface.co/anon-auth.
Paper Type: Short
Research Area: Machine Translation
Research Area Keywords: domain adaptation, multilingual MT
Contribution Types: Publicly available software and/or pre-trained models, Data resources
Languages Studied: English, Hindi, Marathi, Bengali, Tamil, Telugu, Malayalam, Gujarati, Kannada
Submission Number: 2414
Loading