Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning
Keywords: reasoning, llm, synthetic data, chain-of-thought, pretraining, test time computing, evaluation, analysis
TL;DR: We evaluate how reconstructing implicit reasoning processes behind expert texts during continual pretraining enhances LLM reasoning capabilities across STEM and legal domains.
Abstract: Large Language Models (LLMs) have demonstrated significant improvements in reasoning capabilities through supervised fine-tuning and reinforcement learning. However, when training reasoning models, these approaches are primarily applicable to specific domains such as mathematics and programming, which imposes fundamental constraints on the breadth and scalability of training data. In contrast, continual pretraining (CPT) offers the advantage of not requiring task-specific signals. Nevertheless, how to effectively synthesize training data for reasoning and how such data affect a wide range of domains remain largely unexplored. This study provides a detailed evaluation of Reasoning CPT, a form of CPT that uses synthetic data to generate the hidden thought processes underlying texts, based on the premise that texts are the result of the author's thinking process. Our analysis shows that Reasoning CPT can significantly enhance reasoning ability even when trained on non-STEM corpora that have rarely been used for reasoning tasks. On both MMLU and GPQA, Reasoning CPT achieved substantial improvements over the base model and standard CPT. For instance, on GPQA Diamond, performance improved from 23.7\% with the base model to 32.8\% with Reasoning CPT, while on MMLU the benefits became more pronounced as problem difficulty increased, with gains of up to 11.2 points on the hardest questions. Most notably, models trained with hidden thoughts from legal texts outperformed models trained with standard CPT on STEM data, strongly suggesting that reasoning abilities can be enhanced not only from STEM corpora but also from diverse domains, opening a new direction beyond the conventional STEM-centric paradigm of reasoning model training.
Primary Area: generative models
Submission Number: 23549
Loading