MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space
Abstract: Data quality and diversity are pivotal in constructing effective instruction-tuning datasets.
%
With the increasing availability of open-source instruction-tuning datasets, it is advantageous to automatically select high-quality and diverse subsets from a vast amount of data.
%
Existing methods typically prioritize instance quality and use heuristic rules to maintain diversity.
%
However, this absence of a comprehensive view of the entire collection often leads to suboptimal results.
%
Moreover, heuristic rules generally focus on distance or clustering within the embedding space, which fails to capture the intent of complex instructions in semantic space accurately.
%
To bridge this gap, we propose a unified dataset information measurement method. This method models the semantic space by constructing a label graph and quantifies diversity based on the distribution of information within the graph.
%
Based on such measurement, we further introduce an efficient sampling method that selects data samples iteratively to \textbf{M}aximize the \textbf{I}nformation \textbf{G}ain (MIG) in semantic space.
%
Experiments on various datasets and base models demonstrate that MIG consistently outperforms state-of-the-art methods.
%
Notably, the model fine-tuned with 5\% Tulu3 data sampled by MIG achieves comparable performance to the official SFT model trained on the full dataset, with improvements of +5.73\% on AlpacaEval and +6.89\% on Wildbench.
%
This finding shows the potential for unified dataset measurement in guiding instruction data selection.
%
Code will be available.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: data-efficient training
Languages Studied: English,Chinese
Submission Number: 3087
Loading