Complementarity: Toward Better Metrics and Optimizing Data Efficiency in LLMs

TMLR Paper4709 Authors

21 Apr 2025 (modified: 10 Jun 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Generalist Large Language Models (LLMs) are trained with an immense amount of data from across different domains. However, not all data contribute to model performance equally, and prioritized data quality over quantity can improve domain-specific accuracy. We suggest that quality is not merely an independent feature of datasets, but rather the manner in which data samples interfere or complement one another Furthermore, existing evaluation metrics are computationally expensive, require extensive design, are mathematically ill-defined, and are generally poorly suited to LLMs. Toward improving general performance while greatly reducing the amount of training data, and quantifying how data contribute to downstream tasks vis-a-vis their connection with other data, we introduce a new metric, Complementarity. We first establish a strong correlation between Complementarity and domain-specific task performance. Complementarity shows increased robustness over traditional metrics and is significantly less expensive computationally. Furthermore, without the reliance on heavy instruction-tuning and text scraping, Complementarity is easier to apply and applicable to a wide variety of potential target domains. Most interestingly, we demonstrate that the Complementarity taken over a training validation set provides a better predictor of generalization to future test sets than directly measuring performance on a test validation set. With this, we introduce an algorithm that carefully selects the data to fine-tune upon, leading to a high-performing fine-tuned generalist model while using only a fraction of the data, and without requiring data from the test domain. Overall, Complementarity may serve as a key metric in future analysis of data utility and design of datasets, and may prove invaluable in achieving the goal of a truly generalist model.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=tmhj8FhY8L&noteId=tmhj8FhY8L
Changes Since Last Submission: Paper was previously not fully aligned with the TMLR style file due to import ordering, and was thus desk rejected. This was corrected and the paper resubmitted.
Assigned Action Editor: ~Gintare_Karolina_Dziugaite1
Submission Number: 4709
Loading