Density-Aware Farthest Point Sampling

TMLR Paper5638 Authors

14 Aug 2025 (modified: 01 Sept 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We focus on training machine learning regression models in scenarios where the availability labeled training data is limited due to computational constraints or high labeling costs. Thus, selecting suitable training sets from unlabeled data is essential for balancing performance and efficiency. For the selection of the training data, we focus on passive and model-agnostic sampling methods that only consider the data feature representations. We derive an upper bound for the expected prediction error of Lipschitz continuous regression models that linearly depends on the weighted fill distance of the training set—a quantity we can estimate simply by considering the data features. We introduce "Density-Aware Farthest Point Sampling" (DA-FPS), a novel sampling method. We prove that DA-FPS provides approximate minimizers for a data-driven estimation of the weighted fill distance, thereby aiming at minimizing our derived bound. We conduct experiments using two regression models across three datasets. The results demonstrate that DA-FPS significantly reduces the mean absolute prediction error compared to other sampling strategies.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Yu_Bai1
Submission Number: 5638
Loading