CycleDistill: Bootstrapping Machine Translation using LLMs with Cyclical Distillation

CycleDistill: Bootstrapping Machine Translation using LLMs with Cyclical Distillation

ACL ARR 2025 July Submission1137 Authors

29 Jul 2025 (modified: 01 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs), despite their ability to perform few-shot machine translation (MT), often lag behind dedicated MT systems trained on parallel corpora, which are crucial for high quality machine translation (MT). However, parallel corpora are often scarce or non-existent for low-resource languages. In this paper, we propose **CycleDistill**, a bootstrapping approach leveraging LLMs and few-shot translation to obtain high-quality MT systems. CycleDistill involves iteratively generating synthetic parallel corpora from monolingual corpora via zero- or few-shot MT, which is then used to fine-tune the model that was used for generating said data for MT. CycleDistill does not need parallel corpora beyond 1 to 4 few-shot examples, and in our experiments focusing on three Indian languages, by relying solely on monolingual corpora, it can achieve high-quality machine translation, improving upon a few-shot baseline model by **20-30 chrF points** on average in the first iteration. We also study the effect of leveraging softmax activations during the distillation process and observe mild improvements in translation quality.

Paper Type: Long

Research Area: Machine Translation

Research Area Keywords: Machine Translation, Low-Resource Languages, Knowledge Distillation, Large Language Models, Synthetic Parallel Corpora, Indic Languages, Self-Learning

Contribution Types: Model analysis & interpretability, Approaches to low-resource settings

Languages Studied: Hindi, Bengali, Malayalam, Nepali, Manipuri

Reassignment Request Area Chair: This is not a resubmission

Reassignment Request Reviewers: This is not a resubmission

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: N/A

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: 4.1

B2 Discuss The License For Artifacts: N/A

B3 Artifact Use Consistent With Intended Use: N/A

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B5 Documentation Of Artifacts: Yes

B5 Elaboration: 4

B6 Statistics For Data: Yes

B6 Elaboration: 4.

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: 4

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: 4

C3 Descriptive Statistics: Yes

C3 Elaboration: 5

C4 Parameters For Packages: N/A

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: Yes

E1 Information About Use Of Ai Assistants: No

E1 Elaboration: We only use AI assistants for grammatical checking of our content. This does not have any bearing on our research and hence do not include information about their use.

Author Submission Checklist: yes

Submission Number: 1137

Loading