Soft Gates for Sharp Experts in Tabular Representation Learning

Published: 02 Mar 2026, Last Modified: 02 Mar 2026Sci4DL 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: supervised learning, tabular representation learning, feature attribution
TL;DR: Hard gating kills tabular experts (4pp below dense baseline); soft entropy-regularized routing achieves extreme sparsity and best accuracy by decoupling output sparsity from gradient flow.
Abstract: Neural networks consistently underperform gradient-boosted trees on tabular data, yet the structural reasons remain poorly understood. We design the Sparse Feature Routing Network (SFR Net)—not as a benchmark entry, but as an experimental apparatus—to test three hypotheses about tabular inductive biases: (H1) per-feature experts improve over shared encoders even with fewer parameters, with gains amplified by instance-wise routing; (H2) instance-wise sparsity helps only when differentiable—hard gating collapses optimization; (H3) the learned routing produces faithful attributions confirmed by deletion tests against random baselines. The most striking finding: hard sparsity degrades accuracy below the dense baseline, while entropy-regularized softmax achieves extreme sparsity (2.9 of 14 effective features) and highest accuracy—soft gates produce sharp experts; hard gates produce dead ones. Controlled ablations and generalization across 13 benchmarks and 12 baselines yield testable design principles for tabular architectures.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Submission Number: 73
Loading