Language Matters: How Do Multilingual Input and Reasoning Paths Affect Large Reasoning Models?

ICLR 2026 Conference Submission15373 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multilingual reasoning, cultural reasoning, large language models, large reasoning models
TL;DR: LRMs default to reasoning in English/Chinese regardless of input language. Forcing non-hub reasoning hurts performance; aligning with hub languages helps. This asymmetry varies by task type and model size.
Abstract: Large reasoning models (LRMs), distinguished by their explicit generation of reasoning traces, have demonstrated impressive performance across reasoning tasks, yet their internal multilingual processes remain underexplored. We investigate a critical question: In which language do these models reason when solving problems presented in different languages? Our findings reveal that LRMs predominantly default to reasoning in high-resource "hub" languages like English, regardless of the input language. Using a token prefilling method to steer their internal monologue, we find that constraining models to reason in the input's native language degrades accuracy on reasoning tasks (MMMLU, MATH-500) but can improve performance on cultural and safety benchmarks (CulturalBench, LMSYS-toxic). This phenomenon creates a fundamental trade-off between reasoning accuracy and behavioral alignment that partially mitigates but still persists in larger-scale models. By systematically analyzing these linguistic biases, our work highlights a critical challenge toward developing more equitable and transparent models, particularly as reasoning traces become increasingly user-facing for global audiences.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 15373
Loading