Bread: A Hybrid Approach for Instruction Data Mining Through Balanced Retrieval and Dynamic Data Sampling

Xinlin Zhuang, Xin Mao, Yuan-Hao Jiang, Hongyi Wu, Shangqing Zhao, Li Cai, Shu Liu, Yang Chen, Yuxiang Song, Chenghao Jia, Yuhao Zhou, Man Lan

Published: 01 Jan 2024, Last Modified: 20 May 2025NLPCC (2) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Recent advancements in Instruction Tuning (IT) have shown promise for aligning Large Language Models (LLMs) with users’ intentions, yet its efficacy is often compromised by dependence on high-quality datasets. Previous works have concentrated on the aggregation or production of huge IT datasets through human labor or significant cost-intensive LLM APIs, which lacks adequate mechanisms to guarantee the quality of the resulting data. Moreover, training on such amount of IT data is both time-consuming and costly. To address these issues, we present Bread (Instruction Mining through Balanced REtrieval And Dynamic Data Sampling), a novel approach designed to minimize the requisite volume of IT data. Bread uses a two-stage strategy combining balanced retrieval and dynamic sampling to focus on data diversity and quality, offering a cost-saving solution without relying on any specific LLMs. Experimental results suggest that Bread outperforms baselines and shows great flexibility across various IT datasets and LLMs, thereby marking a step forward in efficient Instruction Tuning. Our code is available at https://github.com/mihara-bot/Bread.