CycleDistill: Bootstrapping Machine Translation using LLMs with Cyclical Distillation

ACL ARR 2025 July Submission1137 Authors

29 Jul 2025 (modified: 01 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs), despite their ability to perform few-shot machine translation (MT), often lag behind dedicated MT systems trained on parallel corpora, which are crucial for high quality machine translation (MT). However, parallel corpora are often scarce or non-existent for low-resource languages. In this paper, we propose **CycleDistill**, a bootstrapping approach leveraging LLMs and few-shot translation to obtain high-quality MT systems. CycleDistill involves iteratively generating synthetic parallel corpora from monolingual corpora via zero- or few-shot MT, which is then used to fine-tune the model that was used for generating said data for MT. CycleDistill does not need parallel corpora beyond 1 to 4 few-shot examples, and in our experiments focusing on three Indian languages, by relying solely on monolingual corpora, it can achieve high-quality machine translation, improving upon a few-shot baseline model by **20-30 chrF points** on average in the first iteration. We also study the effect of leveraging softmax activations during the distillation process and observe mild improvements in translation quality.
Paper Type: Long
Research Area: Machine Translation
Research Area Keywords: Machine Translation, Low-Resource Languages, Knowledge Distillation, Large Language Models, Synthetic Parallel Corpora, Indic Languages, Self-Learning
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings
Languages Studied: Hindi, Bengali, Malayalam, Nepali, Manipuri
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: N/A
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: 4.1
B2 Discuss The License For Artifacts: N/A
B3 Artifact Use Consistent With Intended Use: N/A
B4 Data Contains Personally Identifying Info Or Offensive Content: N/A
B5 Documentation Of Artifacts: Yes
B5 Elaboration: 4
B6 Statistics For Data: Yes
B6 Elaboration: 4.
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: 4
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: 4
C3 Descriptive Statistics: Yes
C3 Elaboration: 5
C4 Parameters For Packages: N/A
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: No
E1 Elaboration: We only use AI assistants for grammatical checking of our content. This does not have any bearing on our research and hence do not include information about their use.
Author Submission Checklist: yes
Submission Number: 1137
Loading