Keywords: Large Language Models, Continue-Pretraining, Graph Problem Reasoning, General Reasoning
TL;DR: We introduce GraphPile, a 13B-token dataset for graph problem reasoning (GPR), to enhance general reasoning in LLMs. Models trained on GraphPile achieve significant gains across diverse reasoning tasks, extending LLM capabilities beyond mathematics.
Abstract: Large Language Models (LLMs) have made remarkable strides in reasoning tasks, yet their performance often falters on novel and complex problems. Domain-specific continue-pretraining (CPT) methods, such as those tailored for mathematical reasoning, have shown promise but lack transferability to broader reasoning tasks. In this work, we pioneer the use of Graph Problem Reasoning (GPR) to enhance LLMs' general reasoning capabilities. GPR tasks—spanning pathfinding, network analysis, numerical computation, and topological reasoning—require sophisticated logical and relational reasoning, making them ideal for teaching diverse reasoning patterns. To achieve this, we introduce GraphPile, the first large-scale corpus specifically designed for CPT using GPR data. Spanning 10.9 billion tokens across 23 graph tasks, the dataset includes Chain-of-Thought, Program-of-Thought, Trace of Execution, and Real-world Graph Data. Using GraphPile, we train GraphMind on popular base models-Llama 3&3.1 and Gemma 2-achieving up to 4.9% higher accuracy in mathematical reasoning and up to 21.2% improvement in non-mathematical reasoning tasks, like logical and commonsense reasoning. By being the first to harness GPR for enhancing reasoning patterns and introducing the first dataset of its kind, our work bridges the gap between domain-specific pretraining and universal reasoning capabilities, advancing the adaptability and robustness of LLMs.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 832
Loading