Fluent Alignment with Disfluent Judges: Post-training for lower-resource languages

ICLR 2026 Conference Submission20980 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: language model, post-training, fluency, low-resource languages, RLAIF
TL;DR: We propose a preference-optimization method for lower-resource languages that results in fluent language models even when aligned by disfluent reward models.
Abstract: We propose a post-training method for lower-resource languages that preserves fluency of language models even when aligned by disfluent reward models. Preference-optimization is now a well-researched topic, but previous work has mostly addressed models for English and Chinese. Lower-resource languages lack both datasets written by native speakers and language models capable of generating fluent synthetic data. Thus, in this work, we focus on developing a fluent preference-aligned language model without any instruction-tuning data in the target language. Our approach uses an on-policy training method, which we compare with two common approaches: supervised finetuning on machine-translated data and multilingual finetuning. We conduct a case study on Norwegian Bokmål and evaluate fluency through native-speaker assessments. The results show that the on-policy aspect is crucial and outperforms the alternatives without relying on any hard-to-obtain data.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 20980
Loading