Abstract: Prior work has successfully applied Reinforcement Learning (RL) to mathematical reasoning—where rules and correctness are well-defined. Yet, generalizing these methods to broader reasoning domains remains challenging due to limited data and the lack of verifiable rewards for unstructured domains. In this work, we propose CrossThink, a framework that systematically incorporates multi-domain corpora into RL training to improve generalization across diverse reasoning tasks. CrossThink addresses key challenges by (1) combining data from varied sources; (2) applying structured templates to control answer-space complexity; (3) filtering for verifiable answers; and (4) optimizing data blending strategies to utilize multi-source data effectively. This enables scalable and verifiable reward modeling beyond math and demonstrates improved accuracies on both math (MATH-500: +30.1\%, AMC23: +27.5\%) and non-math reasoning benchmarks (MMLU-Pro: +12.8\%, GPQA-Diamond: +11.3\%, AGIEval: +15.1\%, SuperGPQA: +3.8\%). Moreover, CrossThink exhibits significantly improved response efficiency—using 28\% fewer tokens for correct answers—highlighting more focused and effective reasoning. Through CrossThink, we demonstrate that integrating multi-domain, multi-format data in RL leads to more accurate, efficient, and generalizable LLMs.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: reinforcement learning, reasoning, chain-of-thought, LLM
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency, Data resources
Languages Studied: English
Submission Number: 4653
Loading