Adapting Multilingual Models for Code-Mixed Translation using Back-to-Back TranslationDownload PDF

Anonymous

17 Aug 2021 (modified: 05 May 2023)ACL ARR 2021 August Blind SubmissionReaders: Everyone
Abstract: In this paper, we explore the problem of translating code-mixed sentences to an equivalent monolingual form. The scarcity of gold standard code-mixed to pure language parallel data makes it difficult to train a translation model that can perform this task reliably. Prior work has addressed the paucity of parallel data with data augmentation techniques. Such techniques rely heavily on external resources, which make the systems difficult to train and scale effectively for multiple languages. We present a simple yet highly effective training scheme for adapting multilingual models to the task of code-mixed translation. Our method eliminates the dependence on external resources by creating synthetic data from a novel two-stage back-translation approach that we propose. We show substantial improvement in translation quality (measured through BLEU), beating existing prior work by up to +3.8 BLEU on code-mixed Hi$\rightarrow$En, Mr$\rightarrow$En, and Bn$\rightarrow$En tasks. On the LinCE Machine Translation leader board, we achieve the highest score for code-mixed Es$\rightarrow$En, beating existing best baseline by +6.5 BLEU, and our own stronger baseline by +1.1 BLEU.
0 Replies

Loading