Keywords: Experimental design, generalization, data collection
Abstract: Real-world machine learning systems are often are trained using a mix of data sources with varying cost and quality. Understanding how the size and composition of a training dataset affect model performance is critical for advancing our understanding of generalization, as well as designing more effective data collection policies. We show that there is a simple, accurate way to predict the loss incurred by a model based on data size and composition. Our work expands recent observations of log-linear generalization error and uses this to cast model performance prediction as a learning problem. Using the theory of optimal experimental design, we derive a simple rational function approximation to generalization error that can be fitted using a few model training runs. Our approach achieves nearly exact ($r^2>.93$) predictions of model performance under substantial extrapolation in two different standard supervised learning tasks and is accurate ($r^2 > .83$) on more challenging machine translation and question answering tasks where baselines achieve worse-than-random performance.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Reviewed Version (pdf): https://openreview.net/references/pdf?id=qxrgS3faYz
12 Replies
Loading