Measuring Data Diversity for Instruction Tuning: A Systematic Analysis and A Reliable Metric

Measuring Data Diversity for Instruction Tuning: A Systematic Analysis and A Reliable Metric

ACL ARR 2025 February Submission7932 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract:

Data diversity is crucial for the instruction tuning of large language models. Existing studies have explored various diversity-aware data selection methods to construct high-quality datasets and enhance model performance. However, the fundamental problem of precisely defining and measuring data diversity remains underexplored, limiting clear guidance for data engineering. To address this, we systematically analyze 11 existing diversity measurement methods by assessing their correlation with model performance through extensive fine-tuning experiments. Our results indicate that a reliable diversity measure should properly account for both inter-sample differences and the information density in the sample space. Building on this, we propose NovelSum, a new diversity metric based on sample-level "novelty." Experiments on both simulated and real-world data show that NovelSum accurately captures diversity variations and achieves a correlation of 0.97 with instruction-tuned model performance, underscoring its value in guiding data engineering practices. Using NovelSum as an optimization objective, we further design a greedy diversity-oriented data selection strategy that outperforms existing approaches, validating both the effectiveness and practical significance of our metric.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: fine-tuning, data-efficient training, data augmentation

Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 7932

Loading