Performance Scaling via Optimal Transport: Enabling Data Selection from Partially Revealed Sources

Feiyang Kang; Hoang Anh Just; Anit Kumar Sahu; Ruoxi Jia

Performance Scaling via Optimal Transport: Enabling Data Selection from Partially Revealed Sources

Feiyang Kang, Hoang Anh Just, Anit Kumar Sahu, Ruoxi Jia

Published: 21 Sept 2023, Last Modified: 02 Nov 2023NeurIPS 2023 posterEveryoneRevisionsBibTeX

Keywords: data-centric AI, data acquisition, data valuation, performance prediction, data markets, optimal transport, scaling laws

TL;DR: A new framework to predict model performance on training data from multiple sources using only a few pilot data via Optimal Transport, and also finds optimal data source composition for any data scale through efficient gradient-based optimization.

Abstract: Traditionally, data selection has been studied in settings where all samples from prospective sources are fully revealed to a machine learning developer. However, in practical data exchange scenarios, data providers often reveal only a limited subset of samples before an acquisition decision is made. Recently, there have been efforts to fit scaling functions that predict model performance at any *size and data source composition* using the limited available samples. However, these scaling functions are usually black-box, computationally expensive to fit, highly susceptible to overfitting, or/and difficult to optimize for data selection. This paper proposes a framework called *<projektor>*, which predicts model performance and supports data selection decisions based on partial samples of prospective data sources. Our approach distinguishes itself from existing work by introducing a novel *two-stage* performance inference process. In the first stage, we leverage the Optimal Transport distance to predict the model's performance for any data mixture ratio within the range of disclosed data sizes. In the second stage, we extrapolate the performance to larger undisclosed data sizes based on a novel parameter-free mapping technique inspired by neural scaling laws. We further derive an efficient gradient-based method to select data sources based on the projected model performance. Evaluation over a diverse range of applications (e.g., vision, text, fine-tuning, noisy data sources, etc.) demonstrates that *<projektor>* significantly improves existing performance scaling approaches in terms of both the accuracy of performance inference and the computation costs associated with constructing the performance predictor. Also, *<projektor>* outperforms by a wide margin in data selection effectiveness compared to a range of other off-the-shelf solutions. We provide *<projektor>* an open-source toolkit.

Supplementary Material: pdf

Submission Number: 2041

Loading