Balanced Locality-Sensitive Hashing for Online Data Selection

Hoang Phan; Yijun Dong; Andrew Gordon Wilson; Qi Lei

Balanced Locality-Sensitive Hashing for Online Data Selection

Hoang Phan, Yijun Dong, Andrew Gordon Wilson, Qi Lei

Published: 22 Sept 2025, Last Modified: 01 Dec 2025NeurIPS 2025 WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: otimization, generalization, Locality-Sensitive Hashing, data selection

Abstract: Training contemporary foundation models is becoming an astronomical-scale, compute-limited optimization (instead of generalization) problem where heterogeneous data arrive in a stream whose storage is prohibitive, and a central question is how to spend gradient steps on more informative data that brings better convergence. We study online data selection as a variance reduction tool for stochastic optimization, and propose a balanced locality-sensitive hashing (LSH) sampler that is one-pass, simple, and lightweight. Our method has linear complexity in the batch size and gradient dimension and is insensitive to hyperparameters, making it a practical choice for streaming, compute-constrained training. Through extensive experiments on image/text classification and fine-tuning Llama 3 on mixed math corpora, we show that our method matches or exceeds the performance of strong diversity and uncertainty baselines with significantly better efficiency. Gradient similarity analyses further confirm that our selected subsets closely approximate full-data gradients, demonstrating both efficiency and effectiveness in diverse online data selection. Our implementation is available at \url{https://anonymous.4open.science/r/LSH-B1F5/README.md}.

Submission Number: 133

Loading