Automatic Instruction Data Selection for Large Language Models via Uncertainty-Aware Influence Maximization

Published: 29 Jan 2025, Last Modified: 29 Jan 2025WWW 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Web mining and content analysis
Keywords: Large Language Models, Instruction Tuning, Data Selection
Abstract: Recent years have witnessed the prevalent integration of Large Language Models (LLMs) in various Web applications, such as search engines and recommender systems. As an emerging technique, instruction tuning aims to align pre-trained LLMs as capable chatbots that excel at following human instructions. Previous research indicates that selecting an appropriate subset of a large instruction dataset can enhance the capabilities of LLMs and reduce training costs. However, existing works tend to overlook external correlations between instruction examples during data selection process, which can introduce potential bias and lead to sub-optimal performance. To bridge this gap, we formalize this problem from graph influence maximization perspective and propose Uncertainty-aware influence Maximization (UniMax), a data selection framework that explicitly incorporates the complex inter-dependencies within instruction data. Specifically, we first define a latent instruction graph, treating each instruction example as a graph node and representing their implicit relations as graph edges. Instead of solely relying on heuristic metrics for graph construction, we develop a self-supervised graph learner to uncover the latent structure beyond surface-level feature correlations. After that, we propose an uncertainty-aware influence function to score each example on the instruction graph, allowing a simple greedy algorithm to select a valuable subset that embodies both high influence and uncertainty with an approximation guarantee. Extensive experiments on public datasets show that the proposed approach can significantly enhance model capabilities, underscoring the importance of exploiting data dependencies in instruction data selection.
Submission Number: 223
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview