Toward Dataset Distillation for Regression Problems

Jamie Mahowald; Ravi Srinivasan; Zhangyang Wang

Toward Dataset Distillation for Regression Problems

Jamie Mahowald, Ravi Srinivasan, Zhangyang Wang

Published: 11 Jun 2025, Last Modified: 10 Jul 2025ES-FoMo IIIEveryoneRevisionsBibTeXCC BY 4.0

Keywords: dataset distillation, compression, regression, theoretical machine learning, bilevel optimization

TL;DR: Using dataset distillation, we prove that small synthetic datasets can train regression models nearly as well as full datasets.

Abstract: Dataset distillation is a growing technique that compresses large datasets into smaller synthetic datasets while preserving learning characteristics. However, it remains under-studied for regression problems. This paper presents a theoretical framework for regression dataset distillation using bilevel optimization, where inner loops optimize model parameters on distilled data, while the outer loops refine the dataset itself. For regularized linear regression, we derive closed-form solutions and show approximation guarantees when the number of features is greater than the size of the distilled dataset, using Polyak-Łojasiewicz properties to yield linear rates. Numerical experiments support our predictions with high determination, validating our theory while reducing dataset size by an order of magnitude.

Submission Number: 130

Loading