Feature Selection for Huge Data via Minipatch Learning

TMLR Paper877 Authors

19 Feb 2023 (modified: 17 Sept 2024)Withdrawn by AuthorsEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Feature selection often leads to increased model interpretability, faster computation, and improved model performance by discarding irrelevant or redundant features. While feature selection is a well-studied problem with many widely-used techniques, there are typically two key challenges: i) many existing approaches can become computationally intractable in huge-data settings on the order of millions of features; and ii) the statistical accuracy of selected features often degrades in high-dimensional, high-noise, and high-correlation settings, thus hindering reliable model interpretation. In this work, we tackle these problems by developing Stable Minipatch Selection (STAMPS) and Adaptive STAMPS (AdaSTAMPS). These are meta-algorithms that build ensembles of selection events of base feature selectors trained on many tiny, random or adaptively-chosen subsets of both the observations and features of the data, which are named minipatches. Our approaches are general and can be employed with a variety of existing feature selection strategies and machine learning techniques in practice. In addition, we empirically demonstrate that our approaches, especially AdaSTAMPS, outperform many competing methods in terms of feature selection accuracy and computational time in a variety of numerical experiments; we also show the efficacy of our method in challenging high-dimensional settings common with biological data. Our methods are implemented in the Python package minipatch-learning.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Lijun_Zhang1
Submission Number: 877
Loading