Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

Published: 14 Jul 2025, Last Modified: 14 Jul 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We introduce Generalized Instruction Tuning (called GLAN), a general and scalable method for instruction tuning of Large Language Models (LLMs). Unlike prior work that relies on seed examples or existing datasets to construct instruction-tuning data, GLAN exclusively utilizes a pre-curated taxonomy of human knowledge and capabilities as input and generates large-scale synthetic instruction data across all disciplines. Specifically, inspired by the systematic structure in human education system, we build the taxonomy by decomposing human knowledge and capabilities to various fields, sub-fields and ultimately, distinct disciplines semi-automatically, facilitated by LLMs. Subsequently, we generate a comprehensive list of subjects for every discipline and proceed to design a syllabus tailored to each subject, again utilizing LLMs. With the fine-grained key concepts detailed in every class session of the syllabus, we are able to generate diverse instructions with a broad coverage across the entire spectrum of human knowledge and skills. Extensive experiments on large language models (e.g., Mistral) demonstrate that GLAN excels in multiple dimensions from mathematical reasoning, coding, academic exams, logical reasoning to general instruction following without using task-specific training data of these tasks. In addition, GLAN allows for easy customization and new fields or skills can be added by simply incorporating a new node into our taxonomy. While promising, our approach may inherit biases or inaccuracies from LLM-generated data as in other synthetic data work and is primarily evaluated on exam-style benchmarks. Broader evaluations and data quality control are left for future work.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We added a limitations section to clarify key challenges of our approach. These include potential biases and inaccuracies in LLM-generated data, limited benchmark coverage, and high costs associated with proprietary APIs. We now report results on IFEval and Evol-Instruct, and suggest future evaluation on AlpacaEval and MT-Bench. We also highlight the lack of hallucination detection in our current pipeline and propose exploring open-source models and automated filtering in future work. These limitations are now also mentioned in the introduction and abstract to more explicitly frame the scope and contributions of the paper.
Supplementary Material: zip
Assigned Action Editor: ~Andrew_Kyle_Lampinen1
Submission Number: 4583
Loading