On the (In)Significance of Feature Selection in High-Dimensional Datasets

On the (In)Significance of Feature Selection in High-Dimensional Datasets

ICLR 2026 Conference Submission25529 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: feature selection, null hypothesis testing, negative result, high-dimensional data, computational biology

TL;DR: Tiny random subsets of features match or outperform feature-selected sets across 27 out of 30 high-dimensional datasets, challenging conventional feature selection and highlighting the need for rigorous validation.

Abstract: Feature selection (FS) is assumed to improve predictive performance and highlight meaningful features. We systematically evaluate this across $30$ diverse datasets, including RNA-Seq, mass spectrometry, and imaging. Surprisingly, tiny random subsets of features (0.02-1\%) consistently match or outperform full feature sets in $27$ of $30$ datasets and selected features from published studies (wherever available). In short, any arbitrary set of features is as good as any other (with surprisingly low variance in results) - so how can a particular set of selected features be ''important'' if they perform no better than an arbitrary set? These results indicate the failure of the null hypothesis implicit in claims across many FS papers, challenging the assumption that computationally selected features reliably capture meaningful signals. They also underscore the need for rigorous validation before interpreting selected features as actionable, particularly in computational genomics.

Supplementary Material: pdf

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 25529

Loading