Keywords: Natural Language Processing, Large Language Model
TL;DR: We provide a general language model that supports multiple minority languages
Abstract: Multilingual machine translation for low-resource languages struggles with inefficient tokenization and unstable exploration--exploitation when optimized via reinforcement learning.
We propose Multilingual Translation Policy Optimization (MtPO), a comprehensive three-stage framework: (1) two-stage continued pretraining that expands low-resource vocabularies, boosting compression ratios and inference efficiency; (2) curriculum-based supervised fine-tuning that ramps up task complexity across three phases while preserving general and specialized translation skills; and (3) reinforcement learning optimization that tackles length bias and diversity collapse affecting methods such as GRPO, enhanced with Reinforcement Learning with Verifiable Rewards (RLVR).
The RL component supplements semantic rewards with fast deterministic constraints on length ratio, structural token retention (HTML/Markdown), target-language validity, and code-mixing to harden models against messy real-world prompts.
MtPO couples entropy-tempered advantages, temporal decay, asymmetric clipping, and token-wise reward normalization to sustain early exploration before settling, while RLVR enforces reliable outputs without harming translation quality.
Experiments confirm notable gains in tokenization efficiency, translation quality, and exploration--exploitation balance, marking a substantive step forward for multilingual models serving underrepresented languages and practical deployments.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 1480
Loading