Keywords: Data Valuation, Large Language Models, Contribution Esitimation
TL;DR: An efficient, accurate, and robust data valuation framework for LLM finetuning
Abstract: The training and fine-tuning of large language models (LLMs) heavily rely on a large corpus of high-quality data. Nevertheless, the internet's extensive data is often of varying quality, and collecting high-quality data is exceedingly expensive. To facilitate data engineering and trading, the quantification of data value, also known as data valuation, is emerging as a critical topic. Traditional approaches for data valuation typically depend on model retraining. However, with the increasing model sizes and expansive data volumes of LLMs, these conventional methods are encountering significant declines in valuation precision, efficiency, and transferability. To alleviate these problems, we propose NESTLE, which is an efficient and robust framework for data valuation of LLMs. To accurately estimate the data value distribution across different target domains, we develop a training-free mechanism based on gradient tracing to simulate the data influences. To further tackle the dynamical value adjustment when multiple data providers coexist, we draw inspiration from the Shapley value theory and devise an accelerated strategy for estimating marginal contributions of data through gradient additivity. Extensive experiments demonstrate that our proposed framework NESTLE is capable of accurately and robustly providing accurate estimates of data value with a minuscule cost across a wide range of real-world scenarios.
Supplementary Material: pdf
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2776
Loading