Provable Generalization of Dataset Condensation for Classification via Signal--Noise Dynamics

TMLR Paper9065 Authors

19 May 2026 (modified: 22 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Dataset condensation, particularly via gradient matching, distills massive datasets into compact synthetic sets, making it an important tool for training under severe storage or computation constraints. However, despite strong empirical performance on classification tasks, existing theory largely relies on regression surrogates or static analyses and gives limited explanation of the underlying classification dynamics. We study gradient-matching condensation for regularized hinge-loss SVMs under an additive sub-Gaussian classification model. Our analysis shows that the learned condensed samples act as signal-concentrating representatives: they aggregate class-level structure while suppressing finite-sample noise and initialization residuals. This mechanism leads to population generalization guarantees for geometry-based evaluators and yields an explicit advantage over random one-shot coresets. The dynamics also identify an early-stopping tradeoff: informative structure is encoded early, whereas overly long inner loops can weaken certified signal accumulation. We further give a multiclass one-condensed-sample-per-class extension through a classwise one-vs-rest update and nearest-prototype evaluation, and simulations on synthetic data and KMNIST corroborate the predicted geometry, schedule sensitivity, and multiclass behavior.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Piyush_Rai1
Submission Number: 9065
Loading