Small-to-Large Generalization: Training Data Influences Models Consistently Across Scale

Small-to-Large Generalization: Training Data Influences Models Consistently Across Scale

NeurIPS 2024 Workshop ATTRIB Submission17 Authors

Published: 30 Oct 2024, Last Modified: 14 Jan 2025ATTRIB 2024EveryoneRevisionsBibTeXCC BY 4.0

Release Opt Out: No, I don't wish to opt out of paper release. My paper should be released.

Keywords: data attribution

Abstract: Choice of training data distribution greatly affects model behavior. Yet, in large-scale settings, precisely characterizing *how* changes in training data influence predictions is often difficult due to model training costs. Current practice is to instead extrapolate from scaled down, inexpensive-to-train proxy models. However, changes in data do not influence smaller and larger models identically. Therefore, understanding how choice of data affects large-scale models raises the question: how does training data influence model behavior across compute scale? We find that the answer is nuanced. Small- and large-scale language model predictions generally *do* highly correlate across choice of training data---often, even when small-model predictions are the level of random guessing. However, there *also* exist training datasets for these predictions correlate much less. Equipped with these findings, we characterize how proxy scale affects performance in two downstream proxy model applications: data attribution and dataset selection.

Submission Number: 17

Loading