Abstract: Generalist Large Language Models (LLMs) are trained with an immense amount of data from across different domains. However, not all data contribute to model performance equally, and prioritizing data quality can improve domain-specific performance. We suggest that quality is not merely an independent feature of datasets, but rather the manner in which data samples interfere with or complement one another. Furthermore, existing performance metrics for language models are computationally expensive, while also frequently suffering from being mathematically ill-defined and poorly suited to generative AI. Toward improving general performance while reducing the amount of training data, and quantifying how data contributes to downstream tasks vis-a-vis their relation with other data, we introduce a new metric, Complementarity. We first establish a strong correlation between Complementarity and domain-specific task performance. Without reliance on heavy instruction-tuning and text scraping, Complementarity is significantly less expensive to compute and is applicable to a wide variety of potential target domains. Most interestingly, we demonstrate that the Complementarity taken over a training validation set provides a better predictor of generalization to future test sets than directly measuring performance on a test validation set. With this, we introduce an algorithm that carefully selects the data to fine-tune upon, leading to a high-performing fine-tuned generalist model while using only a fraction of the data, and without requiring data from the test domain. Overall, Complementarity may serve as a key metric in future analysis of data utility and design of datasets, and help facilitate the goal of a truly generalist model.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Gintare_Karolina_Dziugaite1
Submission Number: 4709
Loading