Keywords: LLM Reasoning, Multilingual Thinking, GRPO
Abstract: As LLMs develop stronger multilingual capabilities, the long-standing English-centric bias is gradually diminishing. In some reasoning tasks, responses in non-English languages even surpass those in English. Existing approaches, such as majority voting or weighting across languages, have explored this potential but often fall short due to task complexity and suboptimal language selection. To investigate the role of language diversity in reasoning, we conduct a \textit{Polyglot Thinking Experiment}, prompting models to answer each question in ten different languages or without any language restriction. Results show that non-English responses often achieve higher accuracy than English ones, and the best performance frequently emerges when the model is free to choose its response language. These findings suggest that LLMs benefit from a broader and more flexible multilingual thinking space. Building on this insight, we propose \textbf{Multilingual Group Relative Policy Optimization (mGRPO)}, a reinforcement learning framework that improves LLM reasoning by generating multilingual preference data online using both language-constrained and unconstrained prompts. The model is optimized through group-wise reward comparisons based on accuracy and reasoning format. Despite relying on only ~18.1k training examples without chain-of-thought supervision, mGRPO achieves consistent gains across four benchmarks: MGSM, MATH500, PolyMath, and X-CSQA, outperforming two base LLMs (Qwen2.5 and Llama3) by an average of 7.5\% and obtains SOTA scores. These results highlight the value of multilingual thinking and demonstrate that mGRPO provides a lightweight yet effective approach to unlock reasoning potential in LLMs.
Supplementary Material: zip
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 16216
Loading