The Best of Both Worlds: Bridging Quality and Diversity in Data Selection with Bipartite Graph

Minghao Wu; Thuy-Trang Vu; Lizhen Qu; Gholamreza Haffari

The Best of Both Worlds: Bridging Quality and Diversity in Data Selection with Bipartite Graph

Minghao Wu, Thuy-Trang Vu, Lizhen Qu, Gholamreza Haffari

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The performance of large language models (LLMs) is strongly influenced by the quality and diversity of data used during supervised fine-tuning (SFT). However, current data selection methods often prioritize one aspect over the other, resulting in suboptimal training outcomes. To address this, we formulate data selection as a set cover problem and present GraphFilter, a novel approach that balances both quality and diversity in data selection. GraphFilter models the dataset as a bipartite graph connecting sentences to their constituent n-grams, then employs a priority function that combines quality and diversity metrics multiplicatively. GraphFilter iteratively selects sentences with the highest priority, removes covered n-grams from the bipartite graph, and recomputes priorities to reflect the changing data landscape. We validate GraphFilter using three model backbones across six widely-used benchmarks, demonstrating that it outperforms nine existing baselines in both model performance and computational efficiency. Further analysis shows that our design choices lead to more effective subset selection, underscores the value of instruction diversity, and provides insights into how quality and diversity interact with different subset sizes.

Lay Summary: Training large language models (LLMs) requires selecting the right examples from massive datasets. If we only focus on quality, we might miss out on important diversity; if we only focus on diversity, we might include less useful data. Our work introduces a new method that helps pick examples that are both high-quality and diverse by modeling the data as a network connecting sentences to their key phrases. This approach ensures the selected examples cover a wide range of topics and are useful for training. We show that our method leads to better-performing models and is more efficient than previous techniques. By making data selection smarter, our work can help train language models that are both more accurate and less costly to develop.

Primary Area: Deep Learning->Large Language Models

Keywords: Large Language Models, Data Selection

Submission Number: 9170

Loading