Keywords: Multilingual, Reasoning, Long Chain-of-Thought
TL;DR: We analyze scaling trends, pretraining, and inference to understand long CoT reasoning transfer to non-English languages, demonstrating that post-training techniques overcome key limitations.
Abstract: While large reasoning models have shown remarkable ability to generate long chains-of-thought (CoTs) in English, we still lack understanding of how these long-form reasoning abilities transfer to the vast majority of the world’s languages. In this work, we systematically investigate four key stages of model development–scaling, pretraining, post-training, and inference–to understand how long CoT capabilities extend beyond English. We compare two reasoning settings across nine non-English target languages: En-CoT, where models process target-language inputs, but reason in English; and Target-CoT, where models both process inputs and generate long CoTs in the target language. We find that scaling reasoning model size improves multilingual task performance in En-CoT, but Target-CoT performance lags behind. This gap widens for tasks requiring long, multi-step CoTs such as mathematical reasoning. Shifting to pretraining, we find that adding a specialized reasoning stage enhances En-CoT performance but degrades Target-CoT, whereas broad multilingual pretraining improves both modes simultaneously. Given the scarcity of high-quality reasoning traces in languages other than English, we explore synthetic data curation approaches for post-training. We demonstrate that fine-tuning on reasoning traces
automatically translated from gold English traces outperforms fine-tuning on target-language traces distilled from large reasoning models. Finally, we report disparities in inference efficiency between languages and uncover language-specific failure modes
in CoTs. We release models, datasets, and code to foster further research.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 23576
Loading