Dual-Teacher Agreement for High-Precision Synthetic Data in Low-Resource MT

Published: 14 Jun 2026, Last Modified: 16 Jun 2026ICML 2026 Workshop MusIML PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Low-resource machine translation, synthetic bitext, dual-teacher agreement, multilingual teacher models, cross-lingual semantic alignment, neural machine translation
TL;DR: We improve low-resource MT by generating synthetic bitext from two independent teacher models and keeping only translation pairs that agree and score high on semantic alignment and fluency.
Abstract: Low-resource machine translation (MT) is limited by scarce parallel data, and synthetic bitext from monolingual corpora can help but is often noisy and harmful in low-resource regimes. We propose dual-teacher agreement for high-precision synthetic data construction: two independent multilingual MT teachers translate each source sentence, and an agreement-based filter retains reliable pairs using surface consistency, cross-lingual semantic alignment, and target-side fluency. Experiments show that unfiltered synthetic augmentation is unstable, while single-teacher filtering yields smaller gains. In contrast, dual-teacher agreement consistently improves chrF++ and BLEU and increases robustness under distribution shift. Quality and error analyses confirm that agreement filtering produces cleaner synthetic corpora with fewer entity errors, reduced meaning drift, and improved adequacy.
Track: Track 2: ML Research by Muslim Authors
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Non Archival Confirmation: I understand that submissions to MusIML are non-archival and can be submitted to other venues.
Submission Number: 57
Loading