Aligning Multilingual Reasoning With Verifiable Semantics From A High-Resource Expert Model

Fahim Faisal; Kaiqiang Song; Song Wang; Simin Ma; Shujian Liu; Haoyun Deng; Sathish Reddy Indurthi

Aligning Multilingual Reasoning With Verifiable Semantics From A High-Resource Expert Model

Fahim Faisal, Kaiqiang Song, Song Wang, Simin Ma, Shujian Liu, Haoyun Deng, Sathish Reddy Indurthi

18 Sept 2025 (modified: 09 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multilingual, LLMs, Reinforce Learning, Expert Rewards

TL;DR: We developed a method to improve the reasoning skills of AI models in various languages by using their strong English abilities as a guide.

Abstract: While reinforcement learning has advanced the reasoning abilities of Large Language Models (LLMs), these gains are largely confined to English, creating a significant performance disparity across languages. To address this, we introduce Pivot-Based Reinforcement Learning with Semantically Verifiable Rewards (PB-RLSVR), a novel framework that enhances multilingual reasoning by circumventing the need for human-annotated data in target languages. Our approach employs a high-performing English LLM as a "pivot" model to generate reference responses for reasoning tasks. A multilingual model is then rewarded based on the semantic equivalence of its responses to the English reference, effectively transferring the pivot model's reasoning capabilities across languages. We investigate several cross-lingual semantic reward functions, including those based on embeddings and machine translation. Extensive experiments on a suite of multilingual reasoning benchmarks show that our method significantly narrows the performance gap between English and other languages, substantially outperforming traditional PPO baselines. Specifically, our PB-RLSVR framework improves the average multilingual performance of Llama-3.1-8B-Instruct and Qwen3-32B by 16.41\% and 10.17\%, respectively, demonstrating a powerful and data-efficient approach to building truly multilingual reasoning agents.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 13620

Loading