## Feature Selection using Stochastic Gates

Sep 25, 2019 Blind Submission readers: everyone Show Bibtex
• Keywords: Feature selection, classification, regression, survival analysis
• TL;DR: We develop a fully embedded feature selection method based directly on approximating the $\ell_0$ penalty.
• Abstract: Feature selection problems have been extensively studied in the setting of linear estimation, for instance LASSO, but less emphasis has been placed on feature selection for non-linear functions. In this study, we propose a method for feature selection in high-dimensional non-linear function estimation problems. The new procedure is based on directly penalizing the $\ell_0$ norm of features, or the count of the number of selected features. Our $\ell_0$ based regularization relies on a continuous relaxation of the Bernoulli distribution, which allows our model to learn the parameters of the approximate Bernoulli distributions via gradient descent. The proposed framework simultaneously learns a non-linear regression or classification function while selecting a small subset of features. We provide an information-theoretic justification for incorporating Bernoulli distribution into our approach. Furthermore, we evaluate our method using synthetic and real-life data and demonstrate that our approach outperforms other embedded methods in terms of predictive performance and feature selection.