A Performance-Driven Benchmark for Feature Selection in Tabular Deep Learning

Valeriia Cherepanova; Roman Levin; Gowthami Somepalli; Jonas Geiping; C. Bayan Bruss; Andrew Gordon Wilson; Tom Goldstein; Micah Goldblum

A Performance-Driven Benchmark for Feature Selection in Tabular Deep Learning

Valeriia Cherepanova, Roman Levin, Gowthami Somepalli, Jonas Geiping, C. Bayan Bruss, Andrew Gordon Wilson, Tom Goldstein, Micah Goldblum

Published: 26 Sept 2023, Last Modified: 15 Jan 2024NeurIPS 2023 Datasets and Benchmarks PosterEveryoneRevisionsBibTeX

Keywords: tabular deep learning, tabular data, feature selection, deep lasso, lasso

TL;DR: We construct a challenging feature selection benchmark evaluated on downstream tabular deep learning models and propose an input-gradient-based analogue of LASSO for neural networks.

Abstract: Academic tabular benchmarks often contain small sets of curated features. In contrast, data scientists typically collect as many features as possible into their datasets, and even engineer new features from existing ones. To prevent over-fitting in subsequent downstream modeling, practitioners commonly use automated feature selection methods that identify a reduced subset of informative features. Existing benchmarks for tabular feature selection consider classical downstream models, toy synthetic datasets, or do not evaluate feature selectors on the basis of downstream performance. We construct a challenging feature selection benchmark evaluated on downstream neural networks including transformers, using real datasets and multiple methods for generating extraneous features. We also propose Deep Lasso -- an input-gradient-based analogue of LASSO for neural networks that outperforms classical feature selection methods on challenging problems such as selecting from corrupted or second-order features.

URL: https://github.com/vcherepanova/tabular-feature-selection

Submission Number: 668

Loading