- TL;DR: We develop a simple regression-based model-agnostic feature selection method to interpret data generating processes with FDR control, and outperform several popular baselines on several simulated, medical, and image datasets.
- Abstract: Answering questions about data can require understanding what parts of an input X influence the response Y. Finding such an understanding can be built by testing relationships between variables through a machine learning model. For example, conditional randomization tests help determine whether a variable relates to the response given the rest of the variables. However, randomization tests require users to specify test statistics. We formalize a class of proper test statistics that are guaranteed to select a feature when it provides information about the response even when the rest of the features are known. We show that f-divergences provide a broad class of proper test statistics. In the class of f-divergences, the KL-divergence yields an easy-to-compute proper test statistic that relates to the AMI. Questions of feature importance can be asked at the level of an individual sample. We show that estimators from the same AMI test can also be used to find important features in a particular instance. We provide an example to show that perfect predictive models are insufficient for instance-wise feature selection. We evaluate our method on several simulation experiments, on a genomic dataset, a clinical dataset for hospital readmission, and on a subset of classes in ImageNet. Our method outperforms several baselines in various simulated datasets, is able to identify biologically significant genes, can select the most important predictors of a hospital readmission event, and is able to identify distinguishing features in an image-classification task.
- Code: https://drive.google.com/file/d/13ShNmzQV-rI1DQN2AXpFrNAUKe7rmC0l/view?usp=sharing
- Keywords: feature selection, interpretability, randomization, fdr control, p-values