Keywords: Code-mixing, translation, Human evaluations, NLLB, XNLI, IN22, Largest dataset
Abstract: Code-mixing, the practice of blending multiple languages within a single utterance, is a widespread linguistic phenomenon in multilingual societies such as India. While substantial progress has been made in machine translation for Hinglish (Hindi–English), other low-resource code-mixed variants like Gujlish (Gujarati–English) remain largely unexplored. Developing effective translation systems for such languages is challenging due to the scarcity of high-quality parallel corpora. To bridge this gap, we present the first large-scale, general-purpose Gujlish–English parallel corpus comprising approximately 30k sentence pairs. The dataset was curated from the BPCC corpus (AI4Bharat) and translated using GPT-4o, followed by human validation. We fine-tune the multilingual NLLB-200 model on this corpus to establish the first baselines for Gujlish→English translation. Evaluated on the XNLI and IN22 benchmarks, our model significantly outperforms Google Translate, achieving 1.5–2× improvements in BLEU and ChrF++ scores, and shifting COMET scores from near-zero to strongly positive. Both the dataset and the fine-tuned model is publicly released to support future research on low-resource code-mixed languages.
Submission Number: 19
Loading