Keywords: otimization, generalization, Locality-Sensitive Hashing, data selection
Abstract: Training contemporary foundation models is becoming an astronomical-scale, compute-limited optimization (instead of generalization) problem where heterogeneous data arrive in a stream whose storage is prohibitive, and a central question is how to spend gradient steps on more informative data that brings better convergence. We study online data selection as a variance reduction tool for stochastic optimization, and propose a balanced locality-sensitive hashing (LSH) sampler that is one-pass, simple, and lightweight. Our method has linear complexity in the batch size and gradient dimension and is insensitive to hyperparameters, making it a practical choice for streaming, compute-constrained training. Through extensive experiments on image/text classification and fine-tuning Llama 3 on mixed math corpora, we show that our method matches or exceeds the performance of strong diversity and uncertainty baselines with significantly better efficiency. Gradient similarity analyses further confirm that our selected subsets closely approximate full-data gradients, demonstrating both efficiency and effectiveness in diverse online data selection. Our implementation is available at \url{https://anonymous.4open.science/r/LSH-B1F5/README.md}.
Submission Number: 133
Loading