SAGE: Streaming, Agreement-driven Gradient Sketches for Representative Subset Selection

Published: 29 Sept 2025, Last Modified: 15 Oct 2025NeurIPS 2025 - Reliable ML WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Machine learning, subset selection, training optimization
TL;DR: TL;DR: A streaming, constant-memory subset selection method using a Frequent-Directions gradient sketch and agreement-based scoring that trains nearly as well as full data with far less compute.
Abstract: Training modern neural networks on large datasets is computationally and energy intensive. We present SAGE, a streaming data–subset selection method that maintains a compact Frequent Directions (FD) sketch of gradient geometry in $O(\ell D)$ memory and prioritizes examples whose sketched gradients align with a consensus direction. The approach eliminates $N\times N$ pairwise similarities and explicit $N\times \ell$ gradient stores, yielding a simple two-pass, GPU-friendly pipeline. Leveraging FD’s deterministic approximation guarantees, we analyze how agreement scoring preserves gradient energy within the principal sketched subspace. Across multiple benchmarks, SAGE trains with small kept-rate budgets while retaining competitive accuracy relative to full-data training and recent subset-selection baselines, and reduces end-to-end compute and peak memory. Overall, \textsc{SAGE} offers a practical, constant-memory alternative that complements pruning and model compression for efficient training.
Submission Number: 110
Loading