Not All Documents Are What You Need for Extracting Instruction Tuning Data

Chi Zhang; Huaping Zhong; Hongtao Li; Chengliang Chai; Hongjiawei; Yu-Ping Wang; Yuhao Deng; Jiacheng Wang; Yizhou Yan; Qiu Jiantao; Conghui He; Lei Cao

Not All Documents Are What You Need for Extracting Instruction Tuning Data

Chi Zhang, Huaping Zhong, Hongtao Li, Chengliang Chai, Hongjiawei, Yu-Ping Wang, Yuhao Deng, Jiacheng Wang, Yizhou Yan, Qiu Jiantao, Conghui He, Lei Cao

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: data extraction, data efficient, instruction tuning

Abstract: Instruction tuning improves the LLMs performance but depends on high-quality training data. Recently, LLMs have been used to synthesize data, enhancing training with seeds like question-answer (QA) pairs. However, this synthesis often results in instruction examples similar to the seeds, lacking diversity and biasing real applications. Thus, we propose to extract instruction tuning data from web corpus with much rich knowledge. The most straightforward strategy is to quickly retrieve domain specific documents from the corpus and then extract all QA pairs of these documents for tuning LLMs, which has two main limitations. (1) Extracting all QA pairs using LLMs is prohibitively expensive; and (2) These extracted pairs are not all beneficial for the downstream applications, and incorporating all of them for tuning may even hurt the model performance. To overcome the limitations, we introduce $\texttt{EQUAL}$, an $\textbf{E}$ffective and scalable data extraction framework that iteratively interleaves document selection and extract high-$\textbf{QUAL}$ity QA pairs to optimize instruction tuning. $\texttt{EQUAL}$ first clusters the document set based on the embeddings generated by contrastive learning. Then it leverages the multi-armed bandit based strategy to quickly identify document clusters where can extract high-quality QA pairs for training. This iterative framework significantly reduces computational costs while improving model performance much. Experiments on AutoMathText, KnowledgePile and StackOverflow across 13 downstream tasks demonstrate that $\texttt{EQUAL}$ reduces computational costs by 5–10$\times$ while improving accuracy by 2.5\% on LLaMA-3.1-8B, Qwen2.5-7B and Mistral-7B. Code and data is available at https://anonymous.4open.science/r/EQUAL-DD20.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 14969

Loading