TAPO: Translation Augmented Policy Optimization for Multilingual Reasoning

TAPO: Translation Augmented Policy Optimization for Multilingual Reasoning

ACL ARR 2026 January Submission9184 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multilingualism, mathematical reasoning, language alignment

Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in English mathematical reasoning, yet a significant performance disparity persists in multilingual contexts, largely attributed to deficiencies in language understanding. To bridge this gap, we introduce Translation-Augmented Policy Optimization (TAPO), a novel reinforcement learning framework built upon GRPO. TAPO enforces an explicit alignment strategy where the model leverages English as a pivot and follows an understand-then-reason paradigm. Crucially, we employ a step-level relative advantage mechanism that decouples understanding from reasoning, allowing the integration of translation quality rewards without introducing optimization conflicts. Extensive experiments reveal that TAPO effectively synergizes language understanding with reasoning capabilities and is compatible with various models. It outperforms baseline methods in both multilingual mathematical reasoning and translation tasks, while generalizing well to unseen languages and out-of-domain tasks.

Paper Type: Long

Research Area: Multilinguality and Language Diversity

Research Area Keywords: multilingualism, cross-lingual transfer, Mathematical reasoning

Contribution Types: NLP engineering experiment, Approaches to low-resource settings

Languages Studied: English,Bengali,German,French,Spanish,Japanese,Russian,Swahili,Telugu,Thai,Chinese

Submission Number: 9184

Loading