Data Valuation in the Absence of a Reliable Validation Set

Published: 05 Oct 2024, Last Modified: 05 Oct 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Data valuation plays a pivotal role in ensuring data quality and equitably compensating data contributors. Existing game-theoretic data valuation techniques mostly rely on the availability of a high-quality validation set for their efficacy. However, the feasibility of obtaining a clean validation set drawn from the test distribution may be limited in practice. In this work, we show that the choice of validation set can significantly impact the final data value scores. In order to mitigate this, we introduce a general paradigm that converts a traditional validation-based game-theoretic data valuation method into a validation-free alternative. Specifically, we utilize the cross-validation error as a surrogate for to evaluate the model's performance on a validation set. As computing the cross-validation error can be computationally expensive, we propose using the cross-validation error of a kernel regression model as an effective and efficient surrogate for the true performance score on the population. We compare the performance of the validation-free variant of existing data valuation techniques with their original validation-based counterparts. Our results indicate that the validation-free variants generally match or often significantly surpass the performance of their validation-based counterparts.
