Randomly Pivoted V-optimal Design: Fast Data Selection under Low Intrinsic Dimension

Published: 09 Oct 2024, Last Modified: 19 Nov 2024Compression Workshop @ NeurIPS 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Data selection, Finetuning, Sketching, Optimal experimental design, V-optimality
Abstract: Despite the ubiquitous high-dimensionalities brought about by the increasing sizes of models and data, low intrinsic dimensions are commonly found in many high-dimensional learning problems (e.g. finetuning). To explore sample efficient learning leveraging such low intrinsic dimensions, we introduce randomly pivoted V-optimal design (RPVopt), a fast data selection algorithm that combines dimension reduction via sketching and optimal experimental design. Given a large dataset with $N$ samples in a high dimension $d$, RPVopt first reduces the dimensionality from $d$ to $m \ll d$ by embedding the data to a random low-dimensional subspace via sketching. Then a coreset of size $n > m$ is selected based on the low-dimensional sketched data through an efficient two-stage random pivoting algorithm. With a fast embedding matrix for sketching, RPVopt achieves an asymptotic complexity of $O(Nd+Nnm)$, linear in the full data size, data dimension, and coreset size. With extensive experiments in both regression and classification settings, we demonstrate the empirical effectiveness of RPVopt in data selection for finetuning vision tasks.
Submission Number: 88
Loading